[PATCH net v3] net: sched: do not requeue a NULL skb
A failure in validate_xmit_skb_list() triggered an unconditional call to dev_requeue_skb with skb=NULL. This slowly grows the queue discipline's qlen count until all traffic through the queue stops. We take the optimistic approach and continue running the queue after a failure since it is unknown if later packets also will fail in the validate path. Fixes: 55a93b3ea780 ("qdisc: validate skb without holding lock") Signed-off-by: Lars Persson --- v3: After a discussion with Eric and Cong I went back to v1 and added the likely() for the common path. --- net/sched/sch_generic.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index f18c350..80742ed 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -159,12 +159,15 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, if (validate) skb = validate_xmit_skb_list(skb, dev); - if (skb) { + if (likely(skb)) { HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) skb = dev_hard_start_xmit(skb, dev, txq, &ret); HARD_TX_UNLOCK(dev, txq); + } else { + spin_lock(root_lock); + return qdisc_qlen(q); } spin_lock(root_lock); -- 2.1.4
[PATCH V3] net: mediatek: update the IRQ part of the binding document
The current binding document only describes a single interrupt. Update the document by adding the 2 other interrupts. The driver currently only uses a single interrupt. The HW is however able to using IRQ grouping to split TX and RX onto separate GIC irqs. Signed-off-by: John Crispin Cc: devicet...@vger.kernel.org --- This binding doc was merged in 4.6-rc1 and there are no users yet. The current driver only uses 1 irq but will work fine with all 3 listed in the devicetree. This patch should be merged before v4.6 is final such that listing all 3 irqs becomes part of the ABI. I have already posted a patch that utilizes all 3 irqs for next-next for v4.7 inclusion. Changes in V3: * be verbose about the 3 irqs and their ordering Changes in V2: * split this patch out of the series that fixes tx stalls in the driver Documentation/devicetree/bindings/net/mediatek-net.txt |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt b/Documentation/devicetree/bindings/net/mediatek-net.txt index 5ca7929..32eaaca 100644 --- a/Documentation/devicetree/bindings/net/mediatek-net.txt +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt @@ -9,7 +9,8 @@ have dual GMAC each represented by a child node.. Required properties: - compatible: Should be "mediatek,mt7623-eth" - reg: Address and length of the register set for the device -- interrupts: Should contain the frame engines interrupt +- interrupts: Should contain the three frame engines interrupts in numeric + order. These are fe_int0, fe_int1 and fe_int2. - clocks: the clock used by the core - clock-names: the names of the clock listed in the clocks property. These are "ethif", "esw", "gp2", "gp1" @@ -42,7 +43,9 @@ eth: ethernet@1b10 { <ðsys CLK_ETHSYS_GP2>, <ðsys CLK_ETHSYS_GP1>; clock-names = "ethif", "esw", "gp2", "gp1"; - interrupts = ; + interrupts = ; power-domains = <&scpsys MT2701_POWER_DOMAIN_ETH>; resets = <ðsys MT2701_ETHSYS_ETH_RST>; reset-names = "eth"; -- 1.7.10.4
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck wrote: > Have you taken a look at possibly trying to optimize the DMA pool API > to work with pages? It sounds like it is supposed to do something > similar to what you are wanting to do. Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA coherent memory (see use of dma_alloc_coherent/dma_free_coherent). What we are doing is "streaming" DMA memory, when processing the RX ring. (NIC are only using DMA coherent memory for the descriptors, which are allocated on driver init) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 11 Apr 2016 15:21:26 -0700 Alexei Starovoitov wrote: > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote: > > > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg wrote: > > [...] > > > > > > If we go down this road how about also attaching some driver opaques > > > to the page sets? > > > > That was the ultimate plan... to leave some opaques bytes left in the > > page struct that drivers could use. > > > > In struct page I would need a pointer back to my page_pool struct and a > > page flag. Then, I would need room to store the dma_unmap address. > > (And then some of the usual fields are still needed, like the refcnt, > > and reusing some of the list constructs). And a zero-copy cross-domain > > id. > > I don't think we need to add anything to struct page. > This is supposed to be small cache of dma_mapped pages with lockless access. > It can be implemented as an array or link list where every element > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to > send it to back to page allocator. It sounds like the Intel drivers recycle facility, where they split the page into two parts, and keep page in RX-ring, by swapping to other half of page, if page_count(page) is <= 2. Thus, they use the atomic page ref count to synchronize on. Thus, we end-up having two atomic operations per RX packet, on the page refcnt. Where DPDK have zero... By fully taking over the page as an allocator, almost like slab. I can optimize the common case (of the packet-page getting allocated and free'ed on the same CPU), and remove these atomic operations. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
On Tue, Apr 12, 2016 at 08:36:21AM +0300, Or Gerlitz wrote: > Conflicts happens @ all times, life. > ... > > I understand your desire to get it down to zero, but it's not gonna > work, pick another target. Maybe you are right and the time will show, but now we (Saeed, Matan and me) are trying hard to achieve this goal. > > For example, the networking community has a fairly large rc activity > (I would say 10-20x > vs rdma), so when Dave does his "merge-rebases" for net-next over net > and linus tree > (4-5 times in a release), he has to this way or another solve > conflicts, yes! ditto for > Linus during merge windows and to some extent in rc times too. I don't see any harm in our desire to decrease work overhead from these busy people. > > > It won't help to anyone to split this commit to more than one patch. > > The commit change-log should make it clear what this is about, and it doesn't. > If you believe in something, state that clear, be precise. I agree. > > As Saeed admitted the shared code in the commit spans maybe 2% of it. > > The 1st commit deals with a field which is not used in the driver, > this is a cleanup > that you can do in rc (net) patch (remove the field all together) and > overall, w.o seeing I don't agree with your point that cleanup should go to RC. > the down-stream patches that depend on the newly introduced fields, > how do you know there aren't such (unused) bits in the 2nd commit? No, I don't know in advance, but the truth is that it doesn't bother anyone, because we are exposing our internal HW to kernel clients and doing it with minimal impact on the maintainers. signature.asc Description: Digital signature
Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
On Tue, Apr 12, 2016 at 8:15 AM, Leon Romanovsky wrote: > On Tue, Apr 12, 2016 at 12:37:34AM +0300, Or Gerlitz wrote: >> On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed >> wrote: >> > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz wrote: >> >> >> feature --> features >> >> > Correct, will fix. >> >> >>> * Add vport to steering commands for SRIOV ACL support >> >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM >> >>> * Add support for FCS, baeacon led and disable_link bits to hca caps >> >>> * Add CQE period mode bit in CQ context for CQE based CQ >> >>> moderation support >> >>> * Add umr SQ bit for fragmented memory registration >> >>> * Add needed bits and caps for Striding RQ support >> >> >> AFAIK, all the above are features will go through net-next, what made >> >> you anticipate conflicts with linux-rdma? >> >> > FCS bit is needed also for rdma, so we took the liberty of updating >> > all the needed HW structs, bits, caps, etc .. >> > at once for all mlx5 features planned for 4.7 regardless of rdma/net >> > conflicts. >> >> The cover letter states that this series deals with shared code. >> >> I guess you might also could extend it a bit to deal also with code >> that you suspect could lead to conflicts, but I don't see why it >> evolved to that extent. > > Or, > All these micro-optimizations on this shared file can potentially lead > to undesired merge conflicts. Subsystem maintainers and Linus don't need to > deal with these conflicts at all. Leon, Conflicts happens @ all times, life. We (MLNX) didn't do a good job to minimize them as much as possible on the 4.5 cycle (understatement) and did vast improvement in the 4.6 cycle (one or two conflicts AFAIK and communicated to Linus). It's correct that Linus got really angry on these two conflicts but I have communicated to him the fact that we did that improvement and we're talking on two commits for a fairly large volume of patches, the response was, if you do things right, we will happily keep working with you, people, go look. I understand your desire to get it down to zero, but it's not gonna work, pick another target. For example, the networking community has a fairly large rc activity (I would say 10-20x vs rdma), so when Dave does his "merge-rebases" for net-next over net and linus tree (4-5 times in a release), he has to this way or another solve conflicts, yes! ditto for Linus during merge windows and to some extent in rc times too. > It won't help to anyone to split this commit to more than one patch. The commit change-log should make it clear what this is about, and it doesn't. If you believe in something, state that clear, be precise. As Saeed admitted the shared code in the commit spans maybe 2% of it. The 1st commit deals with a field which is not used in the driver, this is a cleanup that you can do in rc (net) patch (remove the field all together) and overall, w.o seeing the down-stream patches that depend on the newly introduced fields, how do you know there aren't such (unused) bits in the 2nd commit? Or.
[REGRESSION, bisect] pci: cxgb4 probe fails after commit 104daa71b3961434 ("PCI: Determine actual VPD size on first access")
Hi All, The following patch introduced a regression, causing cxgb4 driver to fail in PCIe probe. commit 104daa71b39614343929e1982170d5fcb0569bb5 Author: Hannes Reinecke Author: Hannes Reinecke Date: Mon Feb 15 09:42:01 2016 +0100 PCI: Determine actual VPD size on first access PCI-2.2 VPD entries have a maximum size of 32k, but might actually be smaller than that. To figure out the actual size one has to read the VPD area until the 'end marker' is reached. Per spec, reading outside of the VPD space is "not allowed." In practice, it may cause simple read errors or even crash the card. To make matters worse not every PCI card implements this properly, leaving us with no 'end' marker or even completely invalid data. Try to determine the size of the VPD data when it's first accessed. If no valid data can be read an I/O error will be returned when reading or writing the sysfs attribute. As the amount of VPD data is unknown initially the size of the sysfs attribute will always be set to '0'. [bhelgaas: changelog, use 0/1 (not false/true) for bitfield, tweak pci_vpd_pci22_read() error checking] Tested-by: Shane Seymour Tested-by: Babu Moger Signed-off-by: Hannes Reinecke Signed-off-by: Bjorn Helgaas Cc: Alexander Duyck The problem is stemming from the fact that the Chelsio adapters actually have two VPD structures stored in the VPD. An abbreviated on at Offset 0x0 and the complete VPD at Offset 0x400. The abbreviated one only contains the PN, SN and EC Keywords, while the complete VPD contains those plus various adapter constants contained in V0, V1, etc. And it also contains the Base Ethernet MAC Address in the "NA" Keyword which the cxgb4 driver needs when it can't contact the adapter firmware. (We don't have the "NA" Keywork in the VPD Structure at Offset 0x0 because that's not an allowed VPD Keyword in the PCI-E 3.0 specification.) With the new code, the computed size of the VPD is 0x200 and so our efforts to read the VPD at Offset 0x400 silently fails. We check the result of the read looking for a signature 0x82 byte but we're checking against random stack garbage. The end result is that the cxgb4 driver now fails the PCI-E Probe. Thanks, Hari
Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
On Tue, Apr 12, 2016 at 12:37:34AM +0300, Or Gerlitz wrote: > On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed > wrote: > > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz wrote: > > >> feature --> features > > > Correct, will fix. > > >>> * Add vport to steering commands for SRIOV ACL support > >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM > >>> * Add support for FCS, baeacon led and disable_link bits to hca caps > >>> * Add CQE period mode bit in CQ context for CQE based CQ > >>> moderation support > >>> * Add umr SQ bit for fragmented memory registration > >>> * Add needed bits and caps for Striding RQ support > > >> AFAIK, all the above are features will go through net-next, what made > >> you anticipate conflicts with linux-rdma? > > > FCS bit is needed also for rdma, so we took the liberty of updating > > all the needed HW structs, bits, caps, etc .. > > at once for all mlx5 features planned for 4.7 regardless of rdma/net > > conflicts. > > The cover letter states that this series deals with shared code. > > I guess you might also could extend it a bit to deal also with code > that you suspect could lead to conflicts, but I don't see why it > evolved to that extent. Or, All these micro-optimizations on this shared file can potentially lead to undesired merge conflicts. Subsystem maintainers and Linus don't need to deal with these conflicts at all. It won't help to anyone to split this commit to more than one patch. signature.asc Description: Digital signature
Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
On 4/10/16, 11:28 AM, roopa wrote: > On 4/10/16, 1:16 AM, Thomas Graf wrote: [snip] >> >> This currently ties everything to a net_device with a selector to >> include certain bits of that net_device. How about we take it half a >> step further and allow for non net_device stats such as IP, TCP, >> routing or ipsec stats to be retrieved as well? > yes, absolutely. and that is also the goal. >> A simple array of nested attributes replacing IFLA_STATS_* would >> allow for that, e.g. >> >> 1. {.type = ST_IPSTATS, value = { ...} } >> >> 2. {.type = ST_LINK, .value = { >> {.type = ST_LINK_NAME, .value = "eth0"}, >> {.type = ST_LINK_Q, .value = 10} >> }} >> >> 3. ... > One thing though, Its unclear to me if we absolutely need the additional > nest. > Every stats netlink msg has an ifindex in the header (if_stats_msg) if the > scope > of the stats is a netdev. If the msg does not have an ifindex in the > if_stats_msg, > it represents a global stat. ie Cant a dump, include other stats netlink msgs > after > all the netdev msgs are done when the filter has global stat filters ?. > same will apply to RTM_GETSTATS (without NLM_F_DUMP). > > Since the msg may potentially have more nest levels > in the IFLA_EXT_STATS categories, just trying to see if i can avoid adding > another > top-level nest. We can sure add it if there is no other way to include global > stats in the same dump. > Just wanted to elaborate on what i was trying to say: Top level stats attributes can be netdev or global attributes: We can include string "LINK" in the names of all stats belonging to a netdev to make it easier to recognize the netdev stats (example): IFLA_STATS_LINK64, (netdev) IFLA_STATS_LINK_INET6,(netdev) IFLA_STATS_TCP,(non-netdev, global tcp stats) RTM_GETSTATS (NLM_F_DUMP) with user given filter_mask { If filter-mask contains any link stats, start with per netdev stats messages: { if_stats_msg.ifindex = 1, if_stats_msg.filter_mask = mask of included link stats, } { if_stats_msg.ifindex = 2, if_stats_msg.filter_mask = mask of included link stats, } global stats (if user given filter mask contains global filters.): { if_stats_msg.ifindex = 0, if_stats_msg.filter_mask = mask of included global stats, } } We will need a field in netlink_callback to indicate global or netdev stats when the stats crosses skb boundaries. A single nlmsg cannot have both netdev and global stats. Non-dump RTM_GETSTATS examples: RTM_GETSTATS with valid ifindex and filter_mask { filter_mask cannot have global stats (return -EINVAL) { if_stats_msg.ifindex = , if_stats_msg.filter_mask = mask of included link stats, } } RTM_GETSTATS with ifindex = 0 and filter_mask { filter_mask cannot have link stats (return -EINVAL) { if_stats_msg.ifindex = 0, if_stats_msg.filter_mask = mask of included global link stats, } } Will this not work ? Thanks, Roopa
Re: [PATCH net-next 0/7] DSA refactoring: set 1
Andrew Lunn writes: > There has been a long running effort to refractor DSA probing to make > the switches true linux devices. Here are a small collection of > patches moving in this direction. Most have been seen before. > > We take a little step forward by passing the dsa device point to the > driver, thus allowing it to perform resource allocations using the > normal mechanisms. This device structure will later be replaced by the > devices own device structure. > > Future patches will add a true driver probe function, so we rename the > current probe function, cleaning up the namespace. > > phys_port_mask continually confuses me, thinking it is about PHYs. But > it is actually about ports to the outside world, user ports. So rename > it. > > Lots more patches yet to follow, this is just doing some ground work. > > Andrew Lunn (7): > net: dsa: Pass the dsa device to the switch drivers > net: dsa: Have the switch driver allocate there own private memory > net: dsa: Remove allocation of driver private memory > net: dsa: Keep the mii bus and address in the private structure > net: dsa: Rename DSA probe function. > dsa: Rename phys_port_mask to user_port_mask > dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name() > > drivers/net/dsa/bcm_sf2.c | 24 +--- > drivers/net/dsa/mv88e6060.c | 47 +++--- > drivers/net/dsa/mv88e6060.h | 11 + > drivers/net/dsa/mv88e6123.c | 14 +++- > drivers/net/dsa/mv88e6131.c | 14 +++- > drivers/net/dsa/mv88e6171.c | 14 +++- > drivers/net/dsa/mv88e6352.c | 14 +++- > drivers/net/dsa/mv88e6xxx.c | 55 > +++-- > drivers/net/dsa/mv88e6xxx.h | 17 +++--- > include/net/dsa.h | 16 - > net/dsa/dsa.c | 19 +--- > 11 files changed, 166 insertions(+), 79 deletions(-) Tested-by: Vivien Didelot
Re: [PATCH net-next 7/7] dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()
Andrew Lunn writes: > mv88e6xxx_lookup_name() returns the model name of a switch at a given > address on an MII bus. Using mii_bus to identify the bus rather than > the host device is more logical, so change the parameter. > > Signed-off-by: Andrew Lunn Reviewed-by: Vivien Didelot
Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
On 2016/4/11 20:13, Eric Dumazet wrote: On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote: On 2016/4/8 22:44, Eric Dumazet wrote: On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: I expand tcp_adv_win_scale and tcp_rmem. It has no effect. Try : echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale And restart your flows. cat /proc/sys/net/ipv4/tcp_rmem 10240 2097152 10485760 What about leaving the default values ? I tried, it did not work. $ cat /proc/sys/net/ipv4/tcp_rmem 409687380 6291456 echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale It seems has not effect. I have no idea what you did on the sender side to allow it to send more than 1.5 MB then. We are doing performance test. The sender send 256KB per-block with 128 threads to one socket. And the receiver uses 10Gb NIC to handle the data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi. I added some debug messages and found handling backlog packets in __release_sock() cost about 11ms at most. This can cause backlog queue overflow. The sk_data_ready is re-assigned, it may cost time in our program. I will check it out.
Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY
On Tue, Apr 12, 2016 at 3:23 AM, Florian Fainelli wrote: > On 04/04/16 09:22, Chen-Yu Tsai wrote: >> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and >> configured through a memory mapped hardware register. >> >> This same register also configures the MAC interface mode and TX clock >> source. Also covered by the register, but not supported in these bindings, >> are TX/RX clock delay chains and inverters, and an RMII module. >> >> Signed-off-by: Chen-Yu Tsai >> --- >> .../bindings/net/allwinner,sun8i-h3-ephy.txt | 44 >> ++ >> 1 file changed, 44 insertions(+) >> create mode 100644 >> Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt >> >> diff --git >> a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt >> b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt >> new file mode 100644 >> index ..146f227e6d88 >> --- /dev/null >> +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt >> @@ -0,0 +1,44 @@ >> +* Allwinner H3 E(thernet) PHY >> + >> +The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs, >> +before it can be configured over the MDIO bus and used, certain hardware >> +features must be configured, such as the PHY address and LED polarity. > > Is the internal PHY address really configurable? Not that there is > anything wrong with it, this is good. It is. Things that are configured or provided to a discrete PHY are routed to registers in the SoC, things such as PHY address, clocks, resets. >> +The PHY must also be powered on and brought out of reset. >> + >> +This is accomplished with regulators and pull-up/downs for external PHYs. >> +For this internal PHY, a hardware register is programmed. >> + >> +The same hardware register also contains clock and interface controls >> +for the MAC. This is also present in earlier SoCs, and is covered by >> +"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly >> +different due to the inclusion of what appears to be an RMII-MII >> +bridge. >> + >> +Required properties: >> +- compatible: should be "allwinner,sun8i-h3-ephy" >> +- reg: address and length of the register set for the device >> +- clocks: A phandle to the reference clock for this device >> +- resets: A phandle to the reset control for this device >> + >> +Ethernet PHY related properties: >> +- allwinner,ephy-addr: the MDIO bus address the PHY should respond to. >> +If this is not set, the external PHY is used, and >> +everything else in this section is ignored. > > So we are going to put the same value here, and in the actual Ethernet > PHY device tree node that should be hanging off the EMAC/MDIO bus > controller, this is confusing and error prone. Yes, that would be an issue when writing the DTS. > >> +- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are >> + active high. >> + >> +Ethernet MAC clock related properties: >> +- #clock-cells: should be 0 >> +- clock-output-names: "mac_tx" >> + >> +Example: >> + >> +ethernet-phy@01c00030 { >> + compatible = "allwinner,sun8i-h3-ephy"; >> + reg = <0x01c00030 0x4>; > > Looking at this register space this looks more like an internal PHY SHIM > that is needed to be configured before the internal PHY can be access, > not a proper Ethernet PHY per-se, see replies in aptch 2. > > Should not this block be a second cell associated with the Ethernet MAC > block? One or the other are not going to be very useful without > knowledge of each other. True. However the lower half of the same register also controls the MAC interface mode and TX clock source and delays. This we had a clock driver that was used in conjuction with stmmac on earlier SoCs. I was hoping to keep that model with the newer EMAC. At the time it was argued that what seemed like a clock should be handled by a clock driver, instead of just a "syscon". If this is reaching too far to handle this new use case, I will happily just provide patches to merge this into the MAC. I would like to know how to deal with things like a PHY requiring some sort of shim driver, be it an internal one, or an external mfd chip that happens to have an Ethernet PHY included? How do we tie this into the PHY node under the MDIO bus? Regards ChenYu
Re: [PATCH] devlink: add missing install of header
From: Stephen Hemminger Date: Mon, 11 Apr 2016 13:39:08 -0700 > The new devlink.h in uapi was not being installed by > make headers_install > > Signed-off-by: Stephen Hemminger Applied, thanks Stephen. I wish we could just have everything under uapi/linux automatically get this treatment, but I guess there is a very good reason for the current behavior.
Re: Configuring ethernet link fails with No such device
From: Stefan Agner Date: Mon, 11 Apr 2016 15:46:08 -0700 > What is the expectation/definition when link configuration should be > possible? Only after the network device got opened or before? Only after it is open. Drivers almost always have the entire chip in powerdown state when it is not open, so we wouldn't be able to properly do link settings even if we wanted to when the device is closed.
Re: [PATCH net-next] net: mdio: Fix lockdep falls positive splat
From: Andrew Lunn Date: Mon, 11 Apr 2016 21:40:05 +0200 > MDIO devices can be stacked upon each other. The current code supports > two levels, which until recently has been enough for a DSA mdio bus on > top of another bus. Now we have hardware which has an MDIO mux in the > middle. > > Define an MDIO MUTEX class with three levels. > > Signed-off-by: Andrew Lunn Applied, thanks Andrew.
Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Howells Date: Mon, 11 Apr 2016 23:01:39 +0100 > David Miller wrote: > >> Series applied, but I had to fix up some trailing whitespace (reported by >> GIT) by hand. > > Do you perchance have a git hook script for checking for trailing whitespace? Just try to apply the series yourself with "git am", that's what I do. You can do "git apply --check --whitespace=error-all" on individual patches, but that doesn't work for a series.
Re: [PATCH net v2] net: sched: do not requeue a NULL skb
On Mon, 2016-04-11 at 16:19 -0700, Cong Wang wrote: > My point is, for example, in OOM case, we don't know processin > more SKB would make it better or worse. Maybe we really need to > check the error code to decide to continue to exit? Really, given this bug has been there for a long time (v3.18 ???), I doubt it matters. Nothing can tell that following packets in the qdisc need any transformation, and memory allocations. So I would just fix the bug in the simplest way. __qdisc_run() has all the checks needed to yield when needed (if (quota <= 0 || need_resched())) , no need to add more complexity there.
Re: [PATCH net v2] net: sched: do not requeue a NULL skb
On Mon, Apr 11, 2016 at 11:30 AM, Eric Dumazet wrote: > On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote: >> On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote: >> >> > I am fine with either way as long as the loop stops on failure. > > > Note that skb that could not be validated is already freed. > > So I do not see any value from stopping the loop, since > we need to schedule the queue to avoid tx hang. > > Just process following skb if there is one, fact that skb is sent or > dropped does not matter. My point is, for example, in OOM case, we don't know processing more SKB would make it better or worse. Maybe we really need to check the error code to decide to continue to exit?
Re: Configuring ethernet link fails with No such device
On 11/04/16 15:46, Stefan Agner wrote: > Hi All, > > I traced an issue in which we tried configuring duplex mode and speed in > a systemd-udevd .link file failed with the following error: > "Could not set speed or duplex of eth0 to 0 Mbps (half): No such device" > > The FEC driver (fec_main.c) does not initialize phy_dev until the device > has been opened, and therefor the callback fec_enet_(get|set)_settings > returns -19. > > What is the expectation/definition when link configuration should be > possible? Only after the network device got opened or before? It seems to me like systemd tries to configure the device while we are still opening it, which is arguably something that can be done, but you are in a transient state, the interface is not completely ready yet. > > Or in other words: Is this a Kernel or systemd issue? At what point is > link configuration "guaranteed" to work? Past ndo_open() had a chance to return, therefore connect to the PHY device, is when the configuration/information querying is able to succeed. > > It seems that the FEC driver used to probe the PHY earlier, it has been > moved to the open callback with commit 418bd0d4df ("netdev/fec: fix > ifconfig eth0 down hang issue")... Part of the reason for doing that, is that probing for the PHY during the driver's probe function could leave you with unused HW that is powered on for no good reason. Until we get into ndo_open, we have no clue if the interface is going to be used or not, so it is better to move part of the HW initialization as late as possible in the process. > > There has been a similar report on the systemd-devel mailing list a > while ago, probably the same underlying issue: > https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html > > -- > Stefan > -- Florian
Configuring ethernet link fails with No such device
Hi All, I traced an issue in which we tried configuring duplex mode and speed in a systemd-udevd .link file failed with the following error: "Could not set speed or duplex of eth0 to 0 Mbps (half): No such device" The FEC driver (fec_main.c) does not initialize phy_dev until the device has been opened, and therefor the callback fec_enet_(get|set)_settings returns -19. What is the expectation/definition when link configuration should be possible? Only after the network device got opened or before? Or in other words: Is this a Kernel or systemd issue? At what point is link configuration "guaranteed" to work? It seems that the FEC driver used to probe the PHY earlier, it has been moved to the open callback with commit 418bd0d4df ("netdev/fec: fix ifconfig eth0 down hang issue")... There has been a similar report on the systemd-devel mailing list a while ago, probably the same underlying issue: https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html -- Stefan
Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
On Mon, Apr 11, 2016 at 2:28 PM, Sergei Shtylyov wrote: > On 04/11/2016 10:25 PM, Rob Herring wrote: > >>> The PHY devices sometimes do have their reset signal (maybe even power >>> supply?) tied to some GPIO and sometimes it also does happen that a boot >>> loader does not leave it deasserted. So far this issue has been attacked >>> from (as I believe) a wrong angle: by teaching the MAC driver to >>> manipulate >>> the GPIO in question; that solution, when applied to the device trees, >>> led to adding the PHY reset GPIO properties to the MAC device node, with >>> one exception: Cadence MACB driver which could handle the "reset-gpios" >>> prop in a PHY device subnode. I believe that the correct approach is >>> to >>> teach the 'phylib' to get the MDIO device reset GPIO from the device tree >>> node corresponding to this device -- which this patch is doing... >>> >>> Note that I had to modify the AT803x PHY driver as it would stop working >>> otherwise as it made use of the reset GPIO for its own purposes... >> >> >> Lots of double spaces in here. Please fix. > > >Oh, it's you again! :-D Yep, one of those picky kernel maintainers that like a bad rash just won't go away. :) >>> Signed-off-by: Sergei Shtylyov >>> >>> --- >>> Documentation/devicetree/bindings/net/phy.txt |2 + >>> drivers/net/phy/at803x.c | 19 ++ >>> drivers/net/phy/mdio_bus.c|4 +++ >>> drivers/net/phy/mdio_device.c | 27 >>> +++-- >>> drivers/net/phy/phy_device.c | 33 >>> -- >>> drivers/of/of_mdio.c | 16 >>> include/linux/mdio.h |3 ++ >>> include/linux/phy.h |5 +++ >>> 8 files changed, 89 insertions(+), 20 deletions(-) >> >> >> [...] >> >>> Index: net-next/drivers/of/of_mdio.c >>> === >>> --- net-next.orig/drivers/of/of_mdio.c >>> +++ net-next/drivers/of/of_mdio.c >>> @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n >>> static int of_mdiobus_register_phy(struct mii_bus *mdio, struct >>> device_node *child, >>>u32 addr) >>> { >>> + struct gpio_desc *gpiod; >>> struct phy_device *phy; >>> bool is_c45; >>> int rc; >>> @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc >>> is_c45 = of_device_is_compatible(child, >>> "ethernet-phy-ieee802.3-c45"); >>> >>> + gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios"); >> >> >> Calling fwnode_* functions in a DT specific file/function? That doesn't >> make sense. > > >Really?! 8-) >Where is a DT-only analog I wonder... Ah, you're right. NM. Rob
[PATCH v2 net 4/4] ipv6: udp: Do a route lookup and update during release_cb
This patch adds a release_cb for UDPv6. It does a route lookup and updates sk->sk_dst_cache if it is needed. It picks up the left-over job from ip6_sk_update_pmtu() if the sk was owned by user during the pmtu update. It takes a rcu_read_lock to protect the __sk_dst_get() operations because another thread may do ip6_dst_store() without taking the sk lock (e.g. sendmsg). Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau Reported-by: Wei Wang Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- include/net/ipv6.h | 1 + net/ipv6/datagram.c | 20 net/ipv6/udp.c | 1 + 3 files changed, 22 insertions(+) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index fd02e90..1be050a 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -960,6 +960,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len); int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr, int addr_len); int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr); +void ip6_datagram_release_cb(struct sock *sk); int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 59e01f27..9dd3882 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -119,6 +119,26 @@ out: return err; } +void ip6_datagram_release_cb(struct sock *sk) +{ + struct dst_entry *dst; + + if (ipv6_addr_v4mapped(&sk->sk_v6_daddr)) + return; + + rcu_read_lock(); + dst = __sk_dst_get(sk); + if (!dst || !dst->obsolete || + dst->ops->check(dst, inet6_sk(sk)->dst_cookie)) { + rcu_read_unlock(); + return; + } + rcu_read_unlock(); + + ip6_datagram_dst_update(sk, false); +} +EXPORT_SYMBOL_GPL(ip6_datagram_release_cb); + static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 8125931..6bc5c66 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1539,6 +1539,7 @@ struct proto udpv6_prot = { .sendmsg = udpv6_sendmsg, .recvmsg = udpv6_recvmsg, .backlog_rcv = __udpv6_queue_rcv_skb, + .release_cb= ip6_datagram_release_cb, .hash = udp_lib_hash, .unhash= udp_lib_unhash, .rehash= udp_v6_rehash, -- 2.5.1
Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
David Howells wrote: > > Series applied, but I had to fix up some trailing whitespace (reported by > > GIT) by hand. > > Do you perchance have a git hook script for checking for trailing whitespace? No matter - there's already such a script installed by git clone - I just have to enable it. David
[PATCH v2 net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone. This patch updates the sk->sk_dst_cache for a connected datagram sk during pmtu-update code path. Note that the sk->sk_v6_daddr is used to do the route lookup instead of skb->data (i.e. iph). It is because a UDP socket can become connected after sending out some datagrams in un-connected state. or It can be connected multiple times to different destinations. Hence, iph may not be related to where sk is currently connected to. It is done under '!sock_owned_by_user(sk)' condition because the user may make another ip6_datagram_connect() (i.e changing the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update code path. For the sock_owned_by_user(sk) == true case, the next patch will introduce a release_cb() which will update the sk->sk_dst_cache. Test: Server (Connected UDP Socket): ~ Route Details: [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac' 2fac::/64 dev eth0 proto kernel metric 256 pref medium 2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium A simple python code to create a connected UDP socket: import socket import errno HOST = '2fac::1' PORT = 8080 s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) s.bind((HOST, PORT)) s.connect(('2fac:face::face', 53)) print("connected") while True: try: data = s.recv(1024) except socket.error as se: if se.errno == errno.EMSGSIZE: pmtu = s.getsockopt(41, 24) print("PMTU:%d" % pmtu) break s.close() Python program output after getting a ICMPV6_PKT_TOOBIG: [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py connected PMTU:1300 Cache routes after recieving TOOBIG: [root@arch-fb-vm1 ~]# ip -6 r show table cache 2fac:face::face via 2fac::face dev eth0 metric 0 cache expires 463sec mtu 1300 pref medium Client (Send the ICMPV6_PKT_TOOBIG): ~~~ scapy is used to generate the TOOBIG message. Here is the scapy script I have used: >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', >>> type=0x86dd)/IPv6(src='2fac::face', >>> dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac:: 1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53) >>> sendp(p, iface='qemubr0') Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau Reported-by: Wei Wang Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- include/net/ipv6.h | 1 + net/ipv6/datagram.c | 20 +++- net/ipv6/route.c| 12 3 files changed, 24 insertions(+), 9 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index d0aeb97..fd02e90 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -959,6 +959,7 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int optname, int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len); int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr, int addr_len); +int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr); int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 669585e..59e01f27 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -64,7 +64,7 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk) security_sk_classify_flow(sk, flowi6_to_flowi(fl6)); } -static int ip6_datagram_dst_update(struct sock *sk) +int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr) { struct ip6_flowlabel *flowlabel = NULL; struct in6_addr *final_p, final; @@ -93,14 +93,16 @@ static int ip6_datagram_dst_update(struct sock *sk) goto out; } - if (ipv6_addr_any(&np->saddr)) - np->saddr = fl6.saddr; + if (fix_sk_saddr) { + if (ipv6_addr_any(&np->saddr)) + np->saddr = fl6.saddr; - if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { - sk->sk_v6_rcv_saddr = fl6.saddr; - inet->inet_rcv_saddr = LOOPBACK4_IPV6; - if (sk->sk_prot->rehash) - sk->sk_prot->rehash(sk); + if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { + sk->sk_v6_rcv_saddr = fl6.saddr; + inet->inet_rcv_saddr = LOOPBACK4_IPV6; + if (sk->sk_prot->rehash) +
[PATCH v2 net 1/4] ipv6: datagram: Refactor flowi6 init codes to a new function
Move flowi6 init codes for connected datagram sk to a newly created function ip6_datagram_flow_key_init(). Notes: 1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect 2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init() This new function will be reused during pmtu update in the later patch. Signed-off-by: Martin KaFai Lau Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- net/ipv6/datagram.c | 50 ++ 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 4281621..f07c1dd 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -40,6 +40,30 @@ static bool ipv6_mapped_addr_any(const struct in6_addr *a) return ipv6_addr_v4mapped(a) && (a->s6_addr32[3] == 0); } +static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk) +{ + struct inet_sock *inet = inet_sk(sk); + struct ipv6_pinfo *np = inet6_sk(sk); + + memset(fl6, 0, sizeof(*fl6)); + fl6->flowi6_proto = sk->sk_protocol; + fl6->daddr = sk->sk_v6_daddr; + fl6->saddr = np->saddr; + fl6->flowi6_oif = sk->sk_bound_dev_if; + fl6->flowi6_mark = sk->sk_mark; + fl6->fl6_dport = inet->inet_dport; + fl6->fl6_sport = inet->inet_sport; + fl6->flowlabel = np->flow_label; + + if (!fl6->flowi6_oif) + fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex; + + if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr)) + fl6->flowi6_oif = np->mcast_oif; + + security_sk_classify_flow(sk, flowi6_to_flowi(fl6)); +} + static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr; @@ -52,6 +76,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a struct ipv6_txoptions *opt; int addr_type; int err; + __be32 fl6_flowlabel = 0; if (usin->sin6_family == AF_INET) { if (__ipv6_only_sock(sk)) @@ -66,11 +91,10 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a if (usin->sin6_family != AF_INET6) return -EAFNOSUPPORT; - memset(&fl6, 0, sizeof(fl6)); if (np->sndflow) { - fl6.flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK; - if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) { - flowlabel = fl6_sock_lookup(sk, fl6.flowlabel); + fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK; + if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) { + flowlabel = fl6_sock_lookup(sk, fl6_flowlabel); if (!flowlabel) return -EINVAL; } @@ -145,7 +169,7 @@ ipv4_connected: } sk->sk_v6_daddr = *daddr; - np->flow_label = fl6.flowlabel; + np->flow_label = fl6_flowlabel; inet->inet_dport = usin->sin6_port; @@ -154,21 +178,7 @@ ipv4_connected: * destination cache for it. */ - fl6.flowi6_proto = sk->sk_protocol; - fl6.daddr = sk->sk_v6_daddr; - fl6.saddr = np->saddr; - fl6.flowi6_oif = sk->sk_bound_dev_if; - fl6.flowi6_mark = sk->sk_mark; - fl6.fl6_dport = inet->inet_dport; - fl6.fl6_sport = inet->inet_sport; - - if (!fl6.flowi6_oif) - fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex; - - if (!fl6.flowi6_oif && (addr_type&IPV6_ADDR_MULTICAST)) - fl6.flowi6_oif = np->mcast_oif; - - security_sk_classify_flow(sk, flowi6_to_flowi(&fl6)); + ip6_datagram_flow_key_init(&fl6, sk); rcu_read_lock(); opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt); -- 2.5.1
[PATCH v2 net 0/4] ipv6: datagram: Update dst cache of a connected udp sk during pmtu update
v2: ~ Protect __sk_dst_get() operations with rcu_read_lock in release_cb() because another thread may do ip6_dst_store() for a udp sk without taking the sk lock (e.g. in sendmsg). ~ Do a ipv6_addr_v4mapped(&sk->sk_v6_daddr) check before calling ip6_datagram_dst_update() in patch 3 and 4. It is similar to how __ip6_datagram_connect handles it. ~ One fix in ip6_datagram_dst_update() in patch 2. It needs to check (np->flow_label & IPV6_FLOWLABEL_MASK) before doing fl6_sock_lookup. I was confused with the naming of IPV6_FLOWLABEL_MASK and IPV6_FLOWINFO_MASK. ~ Check dst->obsolete just on the safe side, although I think it should at least have DST_OBSOLETE_FORCE_CHK by now. ~ Add Fixes tag to patch 3 and 4 ~ Add some points from the previous discussion about holding sk lock to the commit message in patch 3. v1: There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone. Patch 1 and 2 are the prep work. Patch 3 and 4 are the fixes.
[PATCH v2 net 2/4] ipv6: datagram: Refactor dst lookup and update codes to a new function
This patch moves the route lookup and update codes for connected datagram sk to a newly created function ip6_datagram_dst_update() It will be reused during the pmtu update in the later patch. Signed-off-by: Martin KaFai Lau Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- net/ipv6/datagram.c | 103 +--- 1 file changed, 57 insertions(+), 46 deletions(-) diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index f07c1dd..669585e 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -64,16 +64,65 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk) security_sk_classify_flow(sk, flowi6_to_flowi(fl6)); } +static int ip6_datagram_dst_update(struct sock *sk) +{ + struct ip6_flowlabel *flowlabel = NULL; + struct in6_addr *final_p, final; + struct ipv6_txoptions *opt; + struct dst_entry *dst; + struct inet_sock *inet = inet_sk(sk); + struct ipv6_pinfo *np = inet6_sk(sk); + struct flowi6 fl6; + int err = 0; + + if (np->sndflow && (np->flow_label & IPV6_FLOWLABEL_MASK)) { + flowlabel = fl6_sock_lookup(sk, np->flow_label); + if (!flowlabel) + return -EINVAL; + } + ip6_datagram_flow_key_init(&fl6, sk); + + rcu_read_lock(); + opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt); + final_p = fl6_update_dst(&fl6, opt, &final); + rcu_read_unlock(); + + dst = ip6_dst_lookup_flow(sk, &fl6, final_p); + if (IS_ERR(dst)) { + err = PTR_ERR(dst); + goto out; + } + + if (ipv6_addr_any(&np->saddr)) + np->saddr = fl6.saddr; + + if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { + sk->sk_v6_rcv_saddr = fl6.saddr; + inet->inet_rcv_saddr = LOOPBACK4_IPV6; + if (sk->sk_prot->rehash) + sk->sk_prot->rehash(sk); + } + + ip6_dst_store(sk, dst, + ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ? + &sk->sk_v6_daddr : NULL, +#ifdef CONFIG_IPV6_SUBTREES + ipv6_addr_equal(&fl6.saddr, &np->saddr) ? + &np->saddr : +#endif + NULL); + +out: + fl6_sock_release(flowlabel); + return err; +} + static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr; struct inet_sock*inet = inet_sk(sk); struct ipv6_pinfo *np = inet6_sk(sk); - struct in6_addr *daddr, *final_p, final; - struct dst_entry*dst; - struct flowi6 fl6; - struct ip6_flowlabel*flowlabel = NULL; - struct ipv6_txoptions *opt; + struct in6_addr *daddr; int addr_type; int err; __be32 fl6_flowlabel = 0; @@ -91,14 +140,8 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a if (usin->sin6_family != AF_INET6) return -EAFNOSUPPORT; - if (np->sndflow) { + if (np->sndflow) fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK; - if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) { - flowlabel = fl6_sock_lookup(sk, fl6_flowlabel); - if (!flowlabel) - return -EINVAL; - } - } addr_type = ipv6_addr_type(&usin->sin6_addr); @@ -178,45 +221,13 @@ ipv4_connected: * destination cache for it. */ - ip6_datagram_flow_key_init(&fl6, sk); - - rcu_read_lock(); - opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt); - final_p = fl6_update_dst(&fl6, opt, &final); - rcu_read_unlock(); - - dst = ip6_dst_lookup_flow(sk, &fl6, final_p); - err = 0; - if (IS_ERR(dst)) { - err = PTR_ERR(dst); + err = ip6_datagram_dst_update(sk); + if (err) goto out; - } - - /* source address lookup done in ip6_dst_lookup */ - - if (ipv6_addr_any(&np->saddr)) - np->saddr = fl6.saddr; - - if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { - sk->sk_v6_rcv_saddr = fl6.saddr; - inet->inet_rcv_saddr = LOOPBACK4_IPV6; - if (sk->sk_prot->rehash) - sk->sk_prot->rehash(sk); - } - - ip6_dst_store(sk, dst, - ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ? - &sk->sk_v6_daddr : NULL, -#ifdef CONFIG_IPV6_SUBTREES - ipv6_addr_equal(&fl6.saddr, &np->saddr) ? - &np->saddr : -#endif - NULL); sk->sk_state = TCP_ESTABLISHED; sk_set_txhash(sk); out: - fl6_so
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote: > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg wrote: > > > >> This is also very interesting for storage targets, which face the same > > >> issue. SCST has a mode where it caches some fully constructed SGLs, > > >> which is probably very similar to what NICs want to do. > > > > > > I think a cached allocator for page sets + the scatterlists that > > > describe these page sets would not only be useful for SCSI target > > > implementations but also for the Linux SCSI initiator. Today the scsi-mq > > > code reserves space in each scsi_cmnd for a scatterlist of > > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page > > > sets less memory would be needed per scsi_cmnd. > > > > If we go down this road how about also attaching some driver opaques > > to the page sets? > > That was the ultimate plan... to leave some opaques bytes left in the > page struct that drivers could use. > > In struct page I would need a pointer back to my page_pool struct and a > page flag. Then, I would need room to store the dma_unmap address. > (And then some of the usual fields are still needed, like the refcnt, > and reusing some of the list constructs). And a zero-copy cross-domain > id. I don't think we need to add anything to struct page. This is supposed to be small cache of dma_mapped pages with lockless access. It can be implemented as an array or link list where every element is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to send it to back to page allocator.
Re: [PATCH iproute2 0/4] vxlan: add VXLAN-GPE support
On Thu, 7 Apr 2016 14:36:25 +0200 Jiri Benc wrote: > This patchset adds a couple of refinements to the 'external' keyword and > adds support for configuring VXLAN-GPE interfaces. > > The VXLAN-GPE support was recently merged to net-next. > > The first patch is not intended to be applied directly but I still include > it in order for this patchset to be compilable and testable by those > interested. > > Jiri Benc (4): > if_link.h: rebase to the kernel > vxlan: 'external' implies 'nolearning' > ip-link.8: document "external" flag for vxlan > vxlan: add support for VXLAN-GPE > > include/linux/if_link.h | 1 + > ip/iplink_vxlan.c | 14 -- > man/man8/ip-link.8.in | 17 + > 3 files changed, 30 insertions(+), 2 deletions(-) > Applied to net-next
Re: [PATCH] iproute2: tc_bpf.c: fix building with musl libc
On Fri, 8 Apr 2016 09:59:33 -0300 Gustavo Zacarias wrote: > We need limits.h for PATH_MAX, fixes: > > tc_bpf.c: In function ‘bpf_map_selfcheck_pinned’: > tc_bpf.c:222:12: error: ‘PATH_MAX’ undeclared (first use in this > function) > char file[PATH_MAX], buff[4096]; > > Signed-off-by: Gustavo Zacarias Applied
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, Apr 11, 2016 at 2:41 PM, Jesper Dangaard Brouer wrote: > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg wrote: > >> >> This is also very interesting for storage targets, which face the same >> >> issue. SCST has a mode where it caches some fully constructed SGLs, >> >> which is probably very similar to what NICs want to do. >> > >> > I think a cached allocator for page sets + the scatterlists that >> > describe these page sets would not only be useful for SCSI target >> > implementations but also for the Linux SCSI initiator. Today the scsi-mq >> > code reserves space in each scsi_cmnd for a scatterlist of >> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page >> > sets less memory would be needed per scsi_cmnd. >> >> If we go down this road how about also attaching some driver opaques >> to the page sets? > > That was the ultimate plan... to leave some opaques bytes left in the > page struct that drivers could use. > > In struct page I would need a pointer back to my page_pool struct and a > page flag. Then, I would need room to store the dma_unmap address. > (And then some of the usual fields are still needed, like the refcnt, > and reusing some of the list constructs). And a zero-copy cross-domain > id. > > > For my packet-page idea, I would need a packet length and an offset > where data starts (I can derive the "head-room" for encap from these > two). Have you taken a look at possibly trying to optimize the DMA pool API to work with pages? It sounds like it is supposed to do something similar to what you are wanting to do. - Alex
Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
David Miller wrote: > Series applied, but I had to fix up some trailing whitespace (reported by > GIT) by hand. Do you perchance have a git hook script for checking for trailing whitespace? David
Re: [PATCH iproute2 -master 0/3] Minor tc/bpf updates
On Sat, 9 Apr 2016 00:32:02 +0200 Daniel Borkmann wrote: > Some minor updates to improve error reporting, add signatures > and recently introduced map flags attribute. Set is against > master branch. > > Thanks! > > Daniel Borkmann (3): > tc, bpf: add new csum and tunnel signatures > tc, bpf: further improve error reporting > tc, bpf: add support for map pre/allocation > > examples/bpf/bpf_cyclic.c | 9 - > examples/bpf/bpf_graft.c| 8 +++- > examples/bpf/bpf_prog.c | 2 + > examples/bpf/bpf_shared.c | 8 +++- > examples/bpf/bpf_tailcall.c | 29 -- > include/bpf_api.h | 52 > include/bpf_elf.h | 1 + > tc/tc_bpf.c | 98 > +++-- > tc/tc_bpf.h | 4 ++ > 9 files changed, 138 insertions(+), 73 deletions(-) > Applied thanks
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg wrote: > >> This is also very interesting for storage targets, which face the same > >> issue. SCST has a mode where it caches some fully constructed SGLs, > >> which is probably very similar to what NICs want to do. > > > > I think a cached allocator for page sets + the scatterlists that > > describe these page sets would not only be useful for SCSI target > > implementations but also for the Linux SCSI initiator. Today the scsi-mq > > code reserves space in each scsi_cmnd for a scatterlist of > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page > > sets less memory would be needed per scsi_cmnd. > > If we go down this road how about also attaching some driver opaques > to the page sets? That was the ultimate plan... to leave some opaques bytes left in the page struct that drivers could use. In struct page I would need a pointer back to my page_pool struct and a page flag. Then, I would need room to store the dma_unmap address. (And then some of the usual fields are still needed, like the refcnt, and reusing some of the list constructs). And a zero-copy cross-domain id. For my packet-page idea, I would need a packet length and an offset where data starts (I can derive the "head-room" for encap from these two). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed wrote: > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz wrote: >> feature --> features > Correct, will fix. >>> * Add vport to steering commands for SRIOV ACL support >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM >>> * Add support for FCS, baeacon led and disable_link bits to hca caps >>> * Add CQE period mode bit in CQ context for CQE based CQ >>> moderation support >>> * Add umr SQ bit for fragmented memory registration >>> * Add needed bits and caps for Striding RQ support >> AFAIK, all the above are features will go through net-next, what made >> you anticipate conflicts with linux-rdma? > FCS bit is needed also for rdma, so we took the liberty of updating > all the needed HW structs, bits, caps, etc .. > at once for all mlx5 features planned for 4.7 regardless of rdma/net > conflicts. The cover letter states that this series deals with shared code. I guess you might also could extend it a bit to deal also with code that you suspect could lead to conflicts, but I don't see why it evolved to that extent. Or.
Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 2016-04-11 at 22:23 +0200, Jesper Dangaard Brouer wrote: > If we have a page-pool recycle facility, then we could use the trick, > right? (As we know that get_page_unless_zero() cannot happen for pages > in the pool). Well, if you disable everything that possibly use get_page_unless_zero(), I guess this could work. But then, you'll have to spy lkml traffic forever to make sure no new feature is added in the kernel, using this get_page_unless_zero() in a new clever way. You could use a page flag so that z BUG() triggers if get_page_unless_zero() is attempted on one of your precious pages ;)\ We had very subtle issues before my fixes (check 35b7a1915aa33da812074744647db0d9262a555c and children), so I would not waste time on the lock prefix avoidance at this point.
Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz wrote: > > feature --> features Correct, will fix. > >> * Add vport to steering commands for SRIOV ACL support >> * Add mlcr, pcmr and mcia registers for dump module EEPROM >> * Add support for FCS, baeacon led and disable_link bits to >> hca caps >> * Add CQE period mode bit in CQ context for CQE based CQ >> moderation support >> * Add umr SQ bit for fragmented memory registration >> * Add needed bits and caps for Striding RQ support > > AFAIK, all the above are features will go through net-next, what made > you anticipate conflicts with linux-rdma? FCS bit is needed also for rdma, so we took the liberty of updating all the needed HW structs, bits, caps, etc .. at once for all mlx5 features planned for 4.7 regardless of rdma/net conflicts.
Re: Getting at the UDP headers from ->data_ready()
Willem de Bruijn wrote: > The network and transport header pointers are still valid. Commit e6afc8ace6dd > only changes where skb->data points to. It does not discard the data between > skb->head and skb->data. This rxrpc follow-up patch fixes some offset > arithmetic to the payload, which is computed relative to skb->data. > > There are other uses of ip_hdr and udp_hdr in udp_recvmsg and similar > recvmsg handlers for other protocols. For instance, in the source address > processing for recvfrom ("if (sin) { .. }") Excellent, thanks! David
Re: [PATCHv2 net-next 4/6] sctp: add the sctp_diag.c file
On Fri, Apr 08, 2016 at 10:51:56PM -0700, Eric Dumazet wrote: > On Sat, 2016-04-09 at 12:53 +0800, Xin Long wrote: > > This one will implement all the interface of inet_diag, inet_diag_handler. > > which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info. > > > > +static int inet_assoc_diag_fill(struct sock *sk, > > + struct sctp_association *asoc, > > + struct sk_buff *skb, > > + const struct inet_diag_req_v2 *req, > > + struct user_namespace *user_ns, > > + int portid, u32 seq, u16 nlmsg_flags, > > + const struct nlmsghdr *unlh) > > +{ > > + const struct inet_sock *inet = inet_sk(sk); > > + const struct inet_diag_handler *handler; > > + int ext = req->idiag_ext; > > + struct inet_diag_msg *r; > > + struct nlmsghdr *nlh; > > + struct nlattr *attr; > > + void *info = NULL; > > + union sctp_addr laddr, paddr; > > + struct dst_entry *dst; > > + struct sctp_infox infox; > > + > > + handler = inet_diag_get_handler(req->sdiag_protocol); > > + BUG_ON(!handler); > > + > > + nlh = nlmsg_put(skb, portid, seq, unlh->nlmsg_type, sizeof(*r), > > + nlmsg_flags); > > + if (!nlh) > > + return -EMSGSIZE; > > + > > + r = nlmsg_data(nlh); > > + BUG_ON(!sk_fullsock(sk)); > > + > > + laddr = list_entry(asoc->base.bind_addr.address_list.next, > > + struct sctp_sockaddr_entry, list)->a; > > + paddr = asoc->peer.primary_path->ipaddr; > > + dst = asoc->peer.primary_path->dst; > > + > > + r->idiag_family = sk->sk_family; > > + r->id.idiag_sport = htons(asoc->base.bind_addr.port); > > + r->id.idiag_dport = htons(asoc->peer.port); > > + r->id.idiag_if = dst ? dst->dev->ifindex : 0; > > + sock_diag_save_cookie(sk, r->id.idiag_cookie); > > + > > +#if IS_ENABLED(CONFIG_IPV6) > > + if (sk->sk_family == AF_INET6) { > > + *(struct in6_addr *)r->id.idiag_src = laddr.v6.sin6_addr; > > + *(struct in6_addr *)r->id.idiag_dst = paddr.v6.sin6_addr; > > + } else > > +#endif > > + { > > + memset(&r->id.idiag_src, 0, sizeof(r->id.idiag_src)); > > + memset(&r->id.idiag_dst, 0, sizeof(r->id.idiag_dst)); > > + > > + r->id.idiag_src[0] = laddr.v4.sin_addr.s_addr; > > + r->id.idiag_dst[0] = paddr.v4.sin_addr.s_addr; > > + } > > + > > + r->idiag_state = asoc->state; > > + r->idiag_timer = SCTP_EVENT_TIMEOUT_T3_RTX; > > + r->idiag_retrans = asoc->rtx_data_chunks; > > +#define EXPIRES_IN_MS(tmo) DIV_ROUND_UP((tmo - jiffies) * 1000, HZ) > > + r->idiag_expires = > > + EXPIRES_IN_MS(asoc->timeouts[SCTP_EVENT_TIMEOUT_T3_RTX]); > > +#undef EXPIRES_IN_MS > > + vvv > > + if (nla_put_u8(skb, INET_DIAG_SHUTDOWN, sk->sk_shutdown)) > > + goto errout; > > + > > + /* IPv6 dual-stack sockets use inet->tos for IPv4 connections, > > +* hence this needs to be included regardless of socket family. > > +*/ > > + if (ext & (1 << (INET_DIAG_TOS - 1))) > > + if (nla_put_u8(skb, INET_DIAG_TOS, inet->tos) < 0) > > + goto errout; > > + > > +#if IS_ENABLED(CONFIG_IPV6) > > + if (r->idiag_family == AF_INET6) { > > + if (ext & (1 << (INET_DIAG_TCLASS - 1))) > > + if (nla_put_u8(skb, INET_DIAG_TCLASS, > > + inet6_sk(sk)->tclass) < 0) > > + goto errout; > > + > > + if (((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) && > > + nla_put_u8(skb, INET_DIAG_SKV6ONLY, ipv6_only_sock(sk))) > > + goto errout; > > + } > > +#endif > > + > > + r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk)); > > + r->idiag_inode = sock_i_ino(sk); > > + > > + if (ext & (1 << (INET_DIAG_MEMINFO - 1))) { > > + struct inet_diag_meminfo minfo = { > > + .idiag_rmem = sk_rmem_alloc_get(sk), > > + .idiag_wmem = sk->sk_wmem_queued, > > + .idiag_fmem = sk->sk_forward_alloc, > > + .idiag_tmem = sk_wmem_alloc_get(sk), > > + }; > > + > > All this code looks familiar. > > Why inet_sk_diag_fill() is not used instead ? Actually we have an issue here. idiag_tmem is using sk_wmem_alloc_get() directly, but it shouldn't. SCTP supports using sndbuf per socket or per assoc on that socket. We need an if and report the correct value, as it's done in sctp_wspace(). For inet_ep_diag_fill, it's reporting an endpoint, so this is not needed as we don't have this option in there. > > + if (nla_put(skb, INET_DIAG_MEMINFO, sizeof(minfo), &minfo) < 0) > > + goto errout; > > + } > > + > > + if (ext & (1 << (INET_DIAG_SKMEMINFO - 1))) > > + if (sock_diag_put_meminfo(sk
Re: Getting at the UDP headers from ->data_ready()
On Mon, Apr 11, 2016 at 5:04 PM, David Howells wrote: > Hi Willem, > > With regards to: > > commit 4d0fc73ebe94ac984a187f21fbf4f3a1ac846f5a > Author: Willem de Bruijn > Date: Thu Apr 7 11:44:59 2016 -0400 > > rxrpc: do not pull udp headers on receive > > Commit e6afc8ace6dd modified the udp receive path by pulling the udp > header before queuing an skbuff onto the receive queue. > > Does that mean that I can no longer access the UDP header via udp_hdr(skb) > from with the ->data_ready() handler? > > I'm guessing that's not actually the case since ip_hdr(skb) seems to work - or > is that something I shouldn't be using since the part of the buffer containing > the IP header might've been discarded? The network and transport header pointers are still valid. Commit e6afc8ace6dd only changes where skb->data points to. It does not discard the data between skb->head and skb->data. This rxrpc follow-up patch fixes some offset arithmetic to the payload, which is computed relative to skb->data. There are other uses of ip_hdr and udp_hdr in udp_recvmsg and similar recvmsg handlers for other protocols. For instance, in the source address processing for recvfrom ("if (sin) { .. }")
Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
On Mon, Apr 11, 2016 at 11:10 PM, Saeed Mahameed wrote: > Adding the needed mlx5_ifc hardware bits and structs > for the following feature: feature --> features > * Add vport to steering commands for SRIOV ACL support > * Add mlcr, pcmr and mcia registers for dump module EEPROM > * Add support for FCS, baeacon led and disable_link bits to > hca caps > * Add CQE period mode bit in CQ context for CQE based CQ > moderation support > * Add umr SQ bit for fragmented memory registration > * Add needed bits and caps for Striding RQ support AFAIK, all the above are features will go through net-next, what made you anticipate conflicts with linux-rdma?
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 2016-04-11 at 21:47 +0200, Jesper Dangaard Brouer wrote: > On Mon, 11 Apr 2016 09:53:54 -0700 > Eric Dumazet wrote: > > > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote: > > > > > Drivers also do tricks where they fallback to smaller order pages. E.g. > > > lookup function mlx4_alloc_pages(). I've tried to simulate that > > > function here: > > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > > > > > > We use order-0 pages on mlx4 at Google, as order-3 pages are very > > dangerous for some kind of attacks... > > Interesting! > > > An out of order TCP packet can hold an order-3 pages, while claiming to > > use 1.5 KBvia skb->truesize. > > > > order-0 only pages allow the page recycle trick used by Intel driver, > > and we hardly see any page allocations in typical workloads. > > Yes, I looked at the Intel ixgbe drivers page recycle trick. > > It is actually quite cool, but code wise it is a little hard to > follow. I started to look at the variant in i40e, specifically > function i40e_clean_rx_irq_ps() explains it a bit more explicit. > > > > While order-3 pages are 'nice' for friendly datacenter kind of > > traffic, they also are a higher risk on hosts connected to the wild > > Internet. > > > > Maybe I should upstream this patch ;) > > Definitely! > > Does this patch also include a page recycle trick? Else how do you get > around the cost of allocating a single order-0 page? > Yes, we use the page recycle trick. Obviously not on powerpc (or any arch with PAGE_SIZE >= 8192), but definitely on x86.
Getting at the UDP headers from ->data_ready()
Hi Willem, With regards to: commit 4d0fc73ebe94ac984a187f21fbf4f3a1ac846f5a Author: Willem de Bruijn Date: Thu Apr 7 11:44:59 2016 -0400 rxrpc: do not pull udp headers on receive Commit e6afc8ace6dd modified the udp receive path by pulling the udp header before queuing an skbuff onto the receive queue. Does that mean that I can no longer access the UDP header via udp_hdr(skb) from with the ->data_ready() handler? I'm guessing that's not actually the case since ip_hdr(skb) seems to work - or is that something I shouldn't be using since the part of the buffer containing the IP header might've been discarded? David
[PATCH] devlink: add missing install of header
The new devlink.h in uapi was not being installed by make headers_install Signed-off-by: Stephen Hemminger --- include/uapi/linux/Kbuild | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index b71fd0b..813ffb2e 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -96,6 +96,7 @@ header-y += cyclades.h header-y += cycx_cfm.h header-y += dcbnl.h header-y += dccp.h +header-y += devlink.h header-y += dlmconstants.h header-y += dlm_device.h header-y += dlm.h -- 2.1.4
Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
David Miller wrote: > Series applied, but I had to fix up some trailing whitespace (reported by > GIT) by hand. Okay, thanks. David
Re: [PATCH net-next 6/7] dsa: Rename phys_port_mask to user_port_mask
On Mon, Apr 11, 2016 at 01:03:52PM -0700, Florian Fainelli wrote: > On 11/04/16 12:50, Andrew Lunn wrote: > > The phys in phys_port_mask suggests this mask is about PHYs. In fact, > > it means physical ports. Rename to user_port_mask, indicating user > > ports of the switch, which is hopefully less confusing. > > Even though the change looks fine in principle, I am more worried about > the difficulty for people to backport fixes because of the renaming > happening here. How about "enabled_ports_mask" as a name? I'm fine with that. Anything, so long as it does not contain phys. > Did not > Guenter had a helper function at some point which tested for (1 << port > & ds->phys_port_mask)? Maybe you are thinking of: static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p) { return ds->phys_port_mask & (1 << p) && ds->ports[p]; } So how about initialized_port_mask, although it is a bit long. Andrew
Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Sat, 09 Apr 2016 05:34:38 -0700 Eric Dumazet wrote: > On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote: > > > Above code is okay. But do you think we also can get away with the same > > trick we do with the SKB refcnf? Where we avoid an atomic operation if > > refcnt==1. > > > > void kfree_skb(struct sk_buff *skb) > > { > > if (unlikely(!skb)) > > return; > > if (likely(atomic_read(&skb->users) == 1)) > > smp_rmb(); > > else if (likely(!atomic_dec_and_test(&skb->users))) > > return; > > trace_kfree_skb(skb, __builtin_return_address(0)); > > __kfree_skb(skb); > > } > > EXPORT_SYMBOL(kfree_skb); > > No we can not use this trick this for pages : > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f > If we have a page-pool recycle facility, then we could use the trick, right? (As we know that get_page_unless_zero() cannot happen for pages in the pool). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
Adding the needed mlx5_ifc hardware bits and structs for the following feature: * Add vport to steering commands for SRIOV ACL support * Add mlcr, pcmr and mcia registers for dump module EEPROM * Add support for FCS, baeacon led and disable_link bits to hca caps * Add CQE period mode bit in CQ context for CQE based CQ moderation support * Add umr SQ bit for fragmented memory registration * Add needed bits and caps for Striding RQ support Signed-off-by: Saeed Mahameed Signed-off-by: Matan Barak --- include/linux/mlx5/mlx5_ifc.h | 146 ++-- 1 files changed, 124 insertions(+), 22 deletions(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index c300e74..4ce4ea4 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -513,7 +513,9 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits { u8 max_lso_cap[0x5]; u8 reserved_at_10[0x4]; u8 rss_ind_tbl_cap[0x4]; - u8 reserved_at_18[0x3]; + u8 reg_umr_sq[0x1]; + u8 scatter_fcs[0x1]; + u8 reserved_at_1a[0x1]; u8 tunnel_lso_const_out_ip_id[0x1]; u8 reserved_at_1c[0x2]; u8 tunnel_statless_gre[0x1]; @@ -648,7 +650,7 @@ struct mlx5_ifc_vector_calc_cap_bits { enum { MLX5_WQ_TYPE_LINKED_LIST = 0x0, MLX5_WQ_TYPE_CYCLIC = 0x1, - MLX5_WQ_TYPE_STRQ = 0x2, + MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ = 0x2, }; enum { @@ -753,7 +755,11 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 early_vf_enable[0x1]; u8 reserved_at_1a9[0x2]; u8 local_ca_ack_delay[0x5]; - u8 reserved_at_1af[0x6]; + u8 reserved_at_1af[0x2]; + u8 ports_check[0x1]; + u8 reserved_at_1b2[0x1]; + u8 disable_link_up[0x1]; + u8 beacon_led[0x1]; u8 port_type[0x2]; u8 num_ports[0x8]; @@ -778,7 +784,8 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 cqe_version[0x4]; u8 compact_address_vector[0x1]; - u8 reserved_at_200[0x3]; + u8 striding_rq[0x1]; + u8 reserved_at_201[0x2]; u8 ipoib_basic_offloads[0x1]; u8 reserved_at_205[0xa]; u8 drain_sigerr[0x1]; @@ -807,12 +814,12 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 block_lb_mc[0x1]; u8 reserved_at_229[0x1]; u8 scqe_break_moderation[0x1]; - u8 reserved_at_22a[0x1]; + u8 cq_period_start_from_cqe[0x1]; u8 cd[0x1]; u8 reserved_at_22d[0x1]; u8 apm[0x1]; u8 vector_calc[0x1]; - u8 reserved_at_22f[0x1]; + u8 umr_ptr_rlky[0x1]; u8 imaicl[0x1]; u8 reserved_at_232[0x4]; u8 qkv[0x1]; @@ -913,10 +920,10 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 reserved_at_500[0x80]; u8 reserved_at_580[0x3f]; - u8 cqe_zip[0x1]; + u8 cqe_compression[0x1]; - u8 cqe_zip_timeout[0x10]; - u8 cqe_zip_max_num[0x10]; + u8 cqe_compression_timeout[0x10]; + u8 cqe_compression_max_num[0x10]; u8 reserved_at_5e0[0x220]; }; @@ -1000,7 +1007,13 @@ struct mlx5_ifc_wq_bits { u8 reserved_at_118[0x3]; u8 log_wq_sz[0x5]; - u8 reserved_at_120[0x4e0]; + u8 reserved_at_120[0x15]; + u8 log_wqe_num_of_strides[0x3]; + u8 two_byte_shift_en[0x1]; + u8 reserved_at_139[0x4]; + u8 log_wqe_stride_size[0x3]; + + u8 reserved_at_140[0x4c0]; struct mlx5_ifc_cmd_pas_bits pas[0]; }; @@ -2199,7 +2212,8 @@ struct mlx5_ifc_sqc_bits { u8 flush_in_error_en[0x1]; u8 reserved_at_4[0x4]; u8 state[0x4]; - u8 reserved_at_c[0x14]; + u8 reg_umr[0x1]; + u8 reserved_at_d[0x13]; u8 reserved_at_20[0x8]; u8 user_index[0x18]; @@ -2247,7 +2261,8 @@ enum { struct mlx5_ifc_rqc_bits { u8 rlky[0x1]; - u8 reserved_at_1[0x2]; + u8 reserved_at_1[0x1]; + u8 scatter_fcs[0x1]; u8 vsd[0x1]; u8 mem_rq_type[0x4]; u8 state[0x4]; @@ -2604,6 +2619,11 @@ enum { MLX5_CQC_ST_FIRED = 0xa, }; +enum { + MLX5_CQ_PERIOD_MODE_START_FROM_EQE = 0x0, + MLX5_CQ_PERIOD_MODE_START_FROM_CQE = 0x1, +}; + struct mlx5_ifc_cqc_bits { u8 status[0x4]; u8 reserved_at_4[0x4]; @@ -2612,8 +2632,8 @@ struct mlx5_ifc_cqc_bits { u8
[PATCH for-next 1/2] net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets
From: Tariq Toukan All reserved fields after early_vf_enable are off by 1, since early_vf_enable was not explicitly declared as array of size 1. Reserved field before cqe_zip had a wrong size, it should be 0x80 + 0x3f. Fixes: b084590e ("net/mlx5_core: Introduce access function to read internal timer ") Fixes: b4ff3a36d3e4 ("net/mlx5: Use offset based reserved field names in the IFC header file") Signed-off-by: Tariq Toukan Signed-off-by: Saeed Mahameed Signed-off-by: Matan Barak --- include/linux/mlx5/mlx5_ifc.h | 107 + 1 files changed, 55 insertions(+), 52 deletions(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index c15b8a8..c300e74 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -750,21 +750,21 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 ets[0x1]; u8 nic_flow_table[0x1]; u8 eswitch_flow_table[0x1]; - u8 early_vf_enable; - u8 reserved_at_1a8[0x2]; + u8 early_vf_enable[0x1]; + u8 reserved_at_1a9[0x2]; u8 local_ca_ack_delay[0x5]; u8 reserved_at_1af[0x6]; u8 port_type[0x2]; u8 num_ports[0x8]; - u8 reserved_at_1bf[0x3]; + u8 reserved_at_1c0[0x3]; u8 log_max_msg[0x5]; - u8 reserved_at_1c7[0x4]; + u8 reserved_at_1c8[0x4]; u8 max_tc[0x4]; - u8 reserved_at_1cf[0x6]; + u8 reserved_at_1d0[0x6]; u8 rol_s[0x1]; u8 rol_g[0x1]; - u8 reserved_at_1d7[0x1]; + u8 reserved_at_1d8[0x1]; u8 wol_s[0x1]; u8 wol_g[0x1]; u8 wol_a[0x1]; @@ -774,47 +774,47 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 wol_p[0x1]; u8 stat_rate_support[0x10]; - u8 reserved_at_1ef[0xc]; + u8 reserved_at_1f0[0xc]; u8 cqe_version[0x4]; u8 compact_address_vector[0x1]; u8 reserved_at_200[0x3]; u8 ipoib_basic_offloads[0x1]; - u8 reserved_at_204[0xa]; + u8 reserved_at_205[0xa]; u8 drain_sigerr[0x1]; u8 cmdif_checksum[0x2]; u8 sigerr_cqe[0x1]; - u8 reserved_at_212[0x1]; + u8 reserved_at_213[0x1]; u8 wq_signature[0x1]; u8 sctr_data_cqe[0x1]; - u8 reserved_at_215[0x1]; + u8 reserved_at_216[0x1]; u8 sho[0x1]; u8 tph[0x1]; u8 rf[0x1]; u8 dct[0x1]; - u8 reserved_at_21a[0x1]; + u8 reserved_at_21b[0x1]; u8 eth_net_offloads[0x1]; u8 roce[0x1]; u8 atomic[0x1]; - u8 reserved_at_21e[0x1]; + u8 reserved_at_21f[0x1]; u8 cq_oi[0x1]; u8 cq_resize[0x1]; u8 cq_moderation[0x1]; - u8 reserved_at_222[0x3]; + u8 reserved_at_223[0x3]; u8 cq_eq_remap[0x1]; u8 pg[0x1]; u8 block_lb_mc[0x1]; - u8 reserved_at_228[0x1]; + u8 reserved_at_229[0x1]; u8 scqe_break_moderation[0x1]; u8 reserved_at_22a[0x1]; u8 cd[0x1]; - u8 reserved_at_22c[0x1]; + u8 reserved_at_22d[0x1]; u8 apm[0x1]; u8 vector_calc[0x1]; u8 reserved_at_22f[0x1]; u8 imaicl[0x1]; - u8 reserved_at_231[0x4]; + u8 reserved_at_232[0x4]; u8 qkv[0x1]; u8 pkv[0x1]; u8 set_deth_sqpn[0x1]; @@ -824,98 +824,101 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 uc[0x1]; u8 rc[0x1]; - u8 reserved_at_23f[0xa]; + u8 reserved_at_240[0xa]; u8 uar_sz[0x6]; - u8 reserved_at_24f[0x8]; + u8 reserved_at_250[0x8]; u8 log_pg_sz[0x8]; u8 bf[0x1]; - u8 reserved_at_260[0x1]; + u8 reserved_at_261[0x1]; u8 pad_tx_eth_packet[0x1]; - u8 reserved_at_262[0x8]; + u8 reserved_at_263[0x8]; u8 log_bf_reg_size[0x5]; - u8 reserved_at_26f[0x10]; + u8 reserved_at_270[0x10]; - u8 reserved_at_27f[0x10]; + u8 reserved_at_280[0x10]; u8 max_wqe_sz_sq[0x10]; - u8 reserved_at_29f[0x10]; + u8 reserved_at_2a0[0x10]; u8 max_wqe_sz_rq[0x10]; - u8 reserved_at_2bf[0x10]; + u8 reserved_at_2c0[0x10];
[PATCH for-next 0/2] mlx5_core: mlx5_ifc updates
Hi Dave and Doug This series include mlx5_core updates for both net-next and rdma trees for 4.7 kernel cycle. This is the only shared code planned for 4.7 between rdma and net trees. Hopefully, this will prevent future conflicts when merging between ib-next and net-next once 4.7 cycle is over and merge window is opened. Both Mellanox rdma and net submissions will proceed once this series is applied into both trees. Future shared code will be sent to both maintainers as pull requests from Mellanox's kernel.org tree. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. Saeed Mahameed (1): net/mlx5: Update mlx5_ifc hardware features Tariq Toukan (1): net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets include/linux/mlx5/mlx5_ifc.h | 253 + 1 files changed, 179 insertions(+), 74 deletions(-)
Re: [PATCH net-next 6/7] dsa: Rename phys_port_mask to user_port_mask
On 11/04/16 12:50, Andrew Lunn wrote: > The phys in phys_port_mask suggests this mask is about PHYs. In fact, > it means physical ports. Rename to user_port_mask, indicating user > ports of the switch, which is hopefully less confusing. Even though the change looks fine in principle, I am more worried about the difficulty for people to backport fixes because of the renaming happening here. How about "enabled_ports_mask" as a name? Did not Guenter had a helper function at some point which tested for (1 << port & ds->phys_port_mask)? > > Signed-off-by: Andrew Lunn > --- > drivers/net/dsa/bcm_sf2.c | 10 +- > drivers/net/dsa/mv88e6060.c | 2 +- > include/net/dsa.h | 4 ++-- > net/dsa/dsa.c | 8 > 4 files changed, 12 insertions(+), 12 deletions(-) > > diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c > index 50caa525cda3..e1e8070fb9f9 100644 > --- a/drivers/net/dsa/bcm_sf2.c > +++ b/drivers/net/dsa/bcm_sf2.c > @@ -160,7 +160,7 @@ static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, > int cpu_port) >* the same VLAN. >*/ > for (i = 0; i < priv->hw_params.num_ports; i++) { > - if (!((1 << i) & ds->phys_port_mask)) > + if (!((1 << i) & ds->user_port_mask)) > continue; > > reg = core_readl(priv, CORE_PORT_VLAN_CTL_PORT(i)); > @@ -1009,7 +1009,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds) > /* Enable all valid ports and disable those unused */ > for (port = 0; port < priv->hw_params.num_ports; port++) { > /* IMP port receives special treatment */ > - if ((1 << port) & ds->phys_port_mask) > + if ((1 << port) & ds->user_port_mask) > bcm_sf2_port_setup(ds, port, NULL); > else if (dsa_is_cpu_port(ds, port)) > bcm_sf2_imp_setup(ds, port); > @@ -1022,7 +1022,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds) >* 7445D0, since 7445E0 disconnects the internal switch pseudo-PHY such >* that we can use the regular SWITCH_MDIO master controller instead. >* > - * By default, DSA initializes ds->phys_mii_mask to ds->phys_port_mask > + * By default, DSA initializes ds->phys_mii_mask to ds->user_port_mask >* to have a 1:1 mapping between Port address and PHY address in order >* to utilize the slave_mii_bus instance to read from Port PHYs. This is >* not what we want here, so we initialize phys_mii_mask 0 to always > @@ -1284,7 +1284,7 @@ static int bcm_sf2_sw_suspend(struct dsa_switch *ds) >* bcm_sf2_sw_setup >*/ > for (port = 0; port < DSA_MAX_PORTS; port++) { > - if ((1 << port) & ds->phys_port_mask || > + if ((1 << port) & ds->user_port_mask || > dsa_is_cpu_port(ds, port)) > bcm_sf2_port_disable(ds, port, NULL); > } > @@ -1308,7 +1308,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds) > bcm_sf2_gphy_enable_set(ds, true); > > for (port = 0; port < DSA_MAX_PORTS; port++) { > - if ((1 << port) & ds->phys_port_mask) > + if ((1 << port) & ds->user_port_mask) > bcm_sf2_port_setup(ds, port, NULL); > else if (dsa_is_cpu_port(ds, port)) > bcm_sf2_imp_setup(ds, port); > diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c > index adb608ccd9aa..0ba0d413cf73 100644 > --- a/drivers/net/dsa/mv88e6060.c > +++ b/drivers/net/dsa/mv88e6060.c > @@ -170,7 +170,7 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, > int p) > REG_WRITE(addr, PORT_VLAN_MAP, > ((p & 0xf) << PORT_VLAN_MAP_DBNUM_SHIFT) | > (dsa_is_cpu_port(ds, p) ? > - ds->phys_port_mask : > + ds->user_port_mask : > BIT(ds->dst->cpu_port))); > > /* Port Association Vector: when learning source addresses > diff --git a/include/net/dsa.h b/include/net/dsa.h > index 165c2e10615c..b36c2a5c206f 100644 > --- a/include/net/dsa.h > +++ b/include/net/dsa.h > @@ -167,7 +167,7 @@ struct dsa_switch { >* Slave mii_bus and devices for the individual ports. >*/ > u32 dsa_port_mask; > - u32 phys_port_mask; > + u32 user_port_mask; > u32 phys_mii_mask; > struct mii_bus *slave_mii_bus; > struct net_device *ports[DSA_MAX_PORTS]; > @@ -185,7 +185,7 @@ static inline bool dsa_is_dsa_port(struct dsa_switch *ds, > int p) > > static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p) > { > - return ds->phys_port_mask & (1 << p) && ds->ports[p]; > + return ds->user_port_mask & (1 << p) && ds->ports[p]; > } > > static inline u8 dsa_upstream_port(struct dsa_switch *ds) > diff --git
[PATCH net-next 7/7] dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()
mv88e6xxx_lookup_name() returns the model name of a switch at a given address on an MII bus. Using mii_bus to identify the bus rather than the host device is more logical, so change the parameter. Signed-off-by: Andrew Lunn --- v2: Check bus is valid before using it. --- drivers/net/dsa/mv88e6xxx.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index c242ffd8eb09..9985a0cf31f1 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -3068,11 +3068,10 @@ int mv88e6xxx_get_temp_alarm(struct dsa_switch *ds, bool *alarm) } #endif /* CONFIG_NET_DSA_HWMON */ -static char *mv88e6xxx_lookup_name(struct device *host_dev, int sw_addr, +static char *mv88e6xxx_lookup_name(struct mii_bus *bus, int sw_addr, const struct mv88e6xxx_switch_id *table, unsigned int num) { - struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); int i, ret; if (!bus) @@ -3090,7 +3089,8 @@ static char *mv88e6xxx_lookup_name(struct device *host_dev, int sw_addr, /* Look up only the product number */ for (i = 0; i < num; ++i) { if (table[i].id == (ret & PORT_SWITCH_ID_PROD_NUM_MASK)) { - dev_warn(host_dev, "unknown revision %d, using base switch 0x%x\n", + dev_warn(&bus->dev, +"unknown revision %d, using base switch 0x%x\n", ret & PORT_SWITCH_ID_REV_MASK, ret & PORT_SWITCH_ID_PROD_NUM_MASK); return table[i].name; @@ -3106,9 +3106,13 @@ char *mv88e6xxx_drv_probe(struct device *dsa_dev, struct device *host_dev, unsigned int num) { struct mv88e6xxx_priv_state *ps; + struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); char *name; - name = mv88e6xxx_lookup_name(host_dev, sw_addr, table, num); + if (!bus) + return NULL; + + name = mv88e6xxx_lookup_name(bus, sw_addr, table, num); if (name) { ps = devm_kzalloc(dsa_dev, sizeof(*ps), GFP_KERNEL); if (!ps) -- 2.7.0
[PATCH net-next 5/7] net: dsa: Rename DSA probe function.
Rename the function called from the DSA to perform a probe for the switch. This makes the normal _probe() name available for a standard Linux device driver probe function. Signed-off-by: Andrew Lunn --- drivers/net/dsa/bcm_sf2.c | 7 --- drivers/net/dsa/mv88e6060.c | 7 --- drivers/net/dsa/mv88e6123.c | 7 --- drivers/net/dsa/mv88e6131.c | 7 --- drivers/net/dsa/mv88e6171.c | 7 --- drivers/net/dsa/mv88e6352.c | 7 --- 6 files changed, 24 insertions(+), 18 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 7d62802a66d5..50caa525cda3 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -135,8 +135,9 @@ static int bcm_sf2_sw_get_sset_count(struct dsa_switch *ds) return BCM_SF2_STATS_SIZE; } -static char *bcm_sf2_sw_probe(struct device *dsa_dev, struct device *host_dev, - int sw_addr, void **_priv) +static char *bcm_sf2_sw_drv_probe(struct device *dsa_dev, + struct device *host_dev, + int sw_addr, void **_priv) { struct bcm_sf2_priv *priv; @@ -1370,7 +1371,7 @@ static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int port, static struct dsa_switch_driver bcm_sf2_switch_driver = { .tag_protocol = DSA_TAG_PROTO_BRCM, - .probe = bcm_sf2_sw_probe, + .probe = bcm_sf2_sw_drv_probe, .setup = bcm_sf2_sw_setup, .set_addr = bcm_sf2_sw_set_addr, .get_phy_flags = bcm_sf2_sw_get_phy_flags, diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c index 46fe5dc65a99..adb608ccd9aa 100644 --- a/drivers/net/dsa/mv88e6060.c +++ b/drivers/net/dsa/mv88e6060.c @@ -69,8 +69,9 @@ static char *mv88e6060_get_name(struct mii_bus *bus, int sw_addr) return NULL; } -static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr, void **_priv) +static char *mv88e6060_drv_probe(struct device *dsa_dev, +struct device *host_dev, +int sw_addr, void **_priv) { struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); struct mv88e6060_priv *priv; @@ -248,7 +249,7 @@ mv88e6060_phy_write(struct dsa_switch *ds, int port, int regnum, u16 val) static struct dsa_switch_driver mv88e6060_switch_driver = { .tag_protocol = DSA_TAG_PROTO_TRAILER, - .probe = mv88e6060_probe, + .probe = mv88e6060_drv_probe, .setup = mv88e6060_setup, .set_addr = mv88e6060_set_addr, .phy_read = mv88e6060_phy_read, diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c index 8aaac0894752..c34283d929c4 100644 --- a/drivers/net/dsa/mv88e6123.c +++ b/drivers/net/dsa/mv88e6123.c @@ -29,8 +29,9 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = { { PORT_SWITCH_ID_6165_A2, "Marvell 88e6165 (A2)" }, }; -static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr, void **priv) +static char *mv88e6123_drv_probe(struct device *dsa_dev, +struct device *host_dev, +int sw_addr, void **priv) { return mv88e6xxx_drv_probe(dsa_dev, host_dev, sw_addr, priv, mv88e6123_table, @@ -106,7 +107,7 @@ static int mv88e6123_setup(struct dsa_switch *ds) struct dsa_switch_driver mv88e6123_switch_driver = { .tag_protocol = DSA_TAG_PROTO_EDSA, - .probe = mv88e6123_probe, + .probe = mv88e6123_drv_probe, .setup = mv88e6123_setup, .set_addr = mv88e6xxx_set_addr_indirect, .phy_read = mv88e6xxx_phy_read, diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c index 9e6edf94a855..f5d75fce1e96 100644 --- a/drivers/net/dsa/mv88e6131.c +++ b/drivers/net/dsa/mv88e6131.c @@ -25,8 +25,9 @@ static const struct mv88e6xxx_switch_id mv88e6131_table[] = { { PORT_SWITCH_ID_6185, "Marvell 88E6185" }, }; -static char *mv88e6131_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr, void **priv) +static char *mv88e6131_drv_probe(struct device *dsa_dev, +struct device *host_dev, +int sw_addr, void **priv) { return mv88e6xxx_drv_probe(dsa_dev, host_dev, sw_addr, priv, mv88e6131_table, @@ -163,7 +164,7 @@ mv88e6131_phy_write(struct dsa_switch *ds, struct dsa_switch_driver mv88e6131_switch_driver = { .tag_protocol = DSA_TAG_PROTO_DSA, - .probe = mv88e6131_probe, + .probe = mv88e6131_drv_p
[PATCH net-next 3/7] net: dsa: Remove allocation of driver private memory
The drivers now allocate their own memory for private usage. Remove the allocation from the core code. Signed-off-by: Andrew Lunn Acked-by: Florian Fainelli --- include/net/dsa.h | 1 - net/dsa/dsa.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/include/net/dsa.h b/include/net/dsa.h index 7bc7bd9b5ded..165c2e10615c 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -213,7 +213,6 @@ struct dsa_switch_driver { struct list_headlist; enum dsa_tag_protocol tag_protocol; - int priv_size; /* * Probing and setup. diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 7ef8a92a9e39..14bf12f637d2 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -402,7 +402,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index, /* * Allocate and initialise switch state. */ - ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL); + ds = devm_kzalloc(parent, sizeof(*ds), GFP_KERNEL); if (ds == NULL) return ERR_PTR(-ENOMEM); -- 2.7.0
[PATCH net-next 6/7] dsa: Rename phys_port_mask to user_port_mask
The phys in phys_port_mask suggests this mask is about PHYs. In fact, it means physical ports. Rename to user_port_mask, indicating user ports of the switch, which is hopefully less confusing. Signed-off-by: Andrew Lunn --- drivers/net/dsa/bcm_sf2.c | 10 +- drivers/net/dsa/mv88e6060.c | 2 +- include/net/dsa.h | 4 ++-- net/dsa/dsa.c | 8 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 50caa525cda3..e1e8070fb9f9 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -160,7 +160,7 @@ static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, int cpu_port) * the same VLAN. */ for (i = 0; i < priv->hw_params.num_ports; i++) { - if (!((1 << i) & ds->phys_port_mask)) + if (!((1 << i) & ds->user_port_mask)) continue; reg = core_readl(priv, CORE_PORT_VLAN_CTL_PORT(i)); @@ -1009,7 +1009,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds) /* Enable all valid ports and disable those unused */ for (port = 0; port < priv->hw_params.num_ports; port++) { /* IMP port receives special treatment */ - if ((1 << port) & ds->phys_port_mask) + if ((1 << port) & ds->user_port_mask) bcm_sf2_port_setup(ds, port, NULL); else if (dsa_is_cpu_port(ds, port)) bcm_sf2_imp_setup(ds, port); @@ -1022,7 +1022,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds) * 7445D0, since 7445E0 disconnects the internal switch pseudo-PHY such * that we can use the regular SWITCH_MDIO master controller instead. * -* By default, DSA initializes ds->phys_mii_mask to ds->phys_port_mask +* By default, DSA initializes ds->phys_mii_mask to ds->user_port_mask * to have a 1:1 mapping between Port address and PHY address in order * to utilize the slave_mii_bus instance to read from Port PHYs. This is * not what we want here, so we initialize phys_mii_mask 0 to always @@ -1284,7 +1284,7 @@ static int bcm_sf2_sw_suspend(struct dsa_switch *ds) * bcm_sf2_sw_setup */ for (port = 0; port < DSA_MAX_PORTS; port++) { - if ((1 << port) & ds->phys_port_mask || + if ((1 << port) & ds->user_port_mask || dsa_is_cpu_port(ds, port)) bcm_sf2_port_disable(ds, port, NULL); } @@ -1308,7 +1308,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds) bcm_sf2_gphy_enable_set(ds, true); for (port = 0; port < DSA_MAX_PORTS; port++) { - if ((1 << port) & ds->phys_port_mask) + if ((1 << port) & ds->user_port_mask) bcm_sf2_port_setup(ds, port, NULL); else if (dsa_is_cpu_port(ds, port)) bcm_sf2_imp_setup(ds, port); diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c index adb608ccd9aa..0ba0d413cf73 100644 --- a/drivers/net/dsa/mv88e6060.c +++ b/drivers/net/dsa/mv88e6060.c @@ -170,7 +170,7 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, int p) REG_WRITE(addr, PORT_VLAN_MAP, ((p & 0xf) << PORT_VLAN_MAP_DBNUM_SHIFT) | (dsa_is_cpu_port(ds, p) ? - ds->phys_port_mask : + ds->user_port_mask : BIT(ds->dst->cpu_port))); /* Port Association Vector: when learning source addresses diff --git a/include/net/dsa.h b/include/net/dsa.h index 165c2e10615c..b36c2a5c206f 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -167,7 +167,7 @@ struct dsa_switch { * Slave mii_bus and devices for the individual ports. */ u32 dsa_port_mask; - u32 phys_port_mask; + u32 user_port_mask; u32 phys_mii_mask; struct mii_bus *slave_mii_bus; struct net_device *ports[DSA_MAX_PORTS]; @@ -185,7 +185,7 @@ static inline bool dsa_is_dsa_port(struct dsa_switch *ds, int p) static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p) { - return ds->phys_port_mask & (1 << p) && ds->ports[p]; + return ds->user_port_mask & (1 << p) && ds->ports[p]; } static inline u8 dsa_upstream_port(struct dsa_switch *ds) diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 14bf12f637d2..20293d1a5a93 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -246,7 +246,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent) } else if (!strcmp(name, "dsa")) { ds->dsa_port_mask |= 1 << i; } else { - ds->phys_port_mask |= 1 << i; + ds->user_port
[PATCH net-next 4/7] net: dsa: Keep the mii bus and address in the private structure
Rather than looking up the mii bus and address every time, do it once at probe, and keep it in the private structure. Centralise this probe code in mv88e6xxx. Signed-off-by: Andrew Lunn Acked-by: Florian Fainelli --- v2: Add check for valid mii_bus. --- drivers/net/dsa/mv88e6060.c | 44 ++-- drivers/net/dsa/mv88e6060.h | 11 +++ drivers/net/dsa/mv88e6123.c | 15 +++ drivers/net/dsa/mv88e6131.c | 15 +++ drivers/net/dsa/mv88e6171.c | 15 +++ drivers/net/dsa/mv88e6352.c | 15 +++ drivers/net/dsa/mv88e6xxx.c | 43 +-- drivers/net/dsa/mv88e6xxx.h | 14 +++--- 8 files changed, 89 insertions(+), 83 deletions(-) diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c index 41195f1417f1..46fe5dc65a99 100644 --- a/drivers/net/dsa/mv88e6060.c +++ b/drivers/net/dsa/mv88e6060.c @@ -19,12 +19,9 @@ static int reg_read(struct dsa_switch *ds, int addr, int reg) { - struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev); + struct mv88e6060_priv *priv = ds_to_priv(ds); - if (bus == NULL) - return -EINVAL; - - return mdiobus_read_nested(bus, ds->pd->sw_addr + addr, reg); + return mdiobus_read_nested(priv->bus, priv->sw_addr + addr, reg); } #define REG_READ(addr, reg)\ @@ -40,12 +37,9 @@ static int reg_read(struct dsa_switch *ds, int addr, int reg) static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val) { - struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev); - - if (bus == NULL) - return -EINVAL; + struct mv88e6060_priv *priv = ds_to_priv(ds); - return mdiobus_write_nested(bus, ds->pd->sw_addr + addr, reg, val); + return mdiobus_write_nested(priv->bus, priv->sw_addr + addr, reg, val); } #define REG_WRITE(addr, reg, val) \ @@ -57,16 +51,10 @@ static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val) return __ret; \ }) -static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr, void **priv) +static char *mv88e6060_get_name(struct mii_bus *bus, int sw_addr) { - struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); int ret; - *priv = NULL; - if (bus == NULL) - return NULL; - ret = mdiobus_read(bus, sw_addr + REG_PORT(0), PORT_SWITCH_ID); if (ret >= 0) { if (ret == PORT_SWITCH_ID_6060) @@ -81,6 +69,26 @@ static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev, return NULL; } +static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev, +int sw_addr, void **_priv) +{ + struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); + struct mv88e6060_priv *priv; + char *name; + + name = mv88e6060_get_name(bus, sw_addr); + if (name) { + priv = devm_kzalloc(dsa_dev, sizeof(*priv), GFP_KERNEL); + if (!priv) + return NULL; + *_priv = priv; + priv->bus = bus; + priv->sw_addr = sw_addr; + } + + return name; +} + static int mv88e6060_switch_reset(struct dsa_switch *ds) { int i; @@ -176,8 +184,8 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, int p) static int mv88e6060_setup(struct dsa_switch *ds) { - int i; int ret; + int i; ret = mv88e6060_switch_reset(ds); if (ret < 0) diff --git a/drivers/net/dsa/mv88e6060.h b/drivers/net/dsa/mv88e6060.h index cc9b2ed4aff4..10249bd16292 100644 --- a/drivers/net/dsa/mv88e6060.h +++ b/drivers/net/dsa/mv88e6060.h @@ -108,4 +108,15 @@ #define GLOBAL_ATU_MAC_23 0x0e #define GLOBAL_ATU_MAC_45 0x0f +struct mv88e6060_priv { + /* MDIO bus and address on bus to use. When in single chip +* mode, address is 0, and the switch uses multiple addresses +* on the bus. When in multi-chip mode, the switch uses a +* single address which contains two registers used for +* indirect access to more registers. +*/ + struct mii_bus *bus; + int sw_addr; +}; + #endif diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c index bcab3ef22448..8aaac0894752 100644 --- a/drivers/net/dsa/mv88e6123.c +++ b/drivers/net/dsa/mv88e6123.c @@ -32,18 +32,9 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = { static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev, int sw_addr, void **priv) { - struct mv88e6xxx_priv_state *ps; - char *name; - - name = mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table, -
[PATCH net-next 1/7] net: dsa: Pass the dsa device to the switch drivers
By passing a device structure to the switch devices, it allows them to use devm_* methods for resource management. Signed-off-by: Andrew Lunn Acked-by: Florian Fainelli --- drivers/net/dsa/bcm_sf2.c | 3 ++- drivers/net/dsa/mv88e6060.c | 3 ++- drivers/net/dsa/mv88e6123.c | 3 ++- drivers/net/dsa/mv88e6131.c | 3 ++- drivers/net/dsa/mv88e6171.c | 3 ++- drivers/net/dsa/mv88e6352.c | 3 ++- include/net/dsa.h | 3 ++- net/dsa/dsa.c | 7 --- 8 files changed, 18 insertions(+), 10 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 780f22876538..18a79579141f 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -135,7 +135,8 @@ static int bcm_sf2_sw_get_sset_count(struct dsa_switch *ds) return BCM_SF2_STATS_SIZE; } -static char *bcm_sf2_sw_probe(struct device *host_dev, int sw_addr) +static char *bcm_sf2_sw_probe(struct device *dsa_dev, struct device *host_dev, + int sw_addr) { return "Broadcom Starfighter 2"; } diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c index 0527f485c3dc..34d0f9fe19db 100644 --- a/drivers/net/dsa/mv88e6060.c +++ b/drivers/net/dsa/mv88e6060.c @@ -57,7 +57,8 @@ static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val) return __ret; \ }) -static char *mv88e6060_probe(struct device *host_dev, int sw_addr) +static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev, +int sw_addr) { struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); int ret; diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c index 69a6f79dcb10..ede4c6f0b31a 100644 --- a/drivers/net/dsa/mv88e6123.c +++ b/drivers/net/dsa/mv88e6123.c @@ -29,7 +29,8 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = { { PORT_SWITCH_ID_6165_A2, "Marvell 88e6165 (A2)" }, }; -static char *mv88e6123_probe(struct device *host_dev, int sw_addr) +static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev, +int sw_addr) { return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table, ARRAY_SIZE(mv88e6123_table)); diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c index 24070287c2bc..bfadfd2cbb8d 100644 --- a/drivers/net/dsa/mv88e6131.c +++ b/drivers/net/dsa/mv88e6131.c @@ -25,7 +25,8 @@ static const struct mv88e6xxx_switch_id mv88e6131_table[] = { { PORT_SWITCH_ID_6185, "Marvell 88E6185" }, }; -static char *mv88e6131_probe(struct device *host_dev, int sw_addr) +static char *mv88e6131_probe(struct device *dsa_dev, struct device *host_dev, +int sw_addr) { return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6131_table, ARRAY_SIZE(mv88e6131_table)); diff --git a/drivers/net/dsa/mv88e6171.c b/drivers/net/dsa/mv88e6171.c index 0e62f3b5bc81..fb35d3ac1644 100644 --- a/drivers/net/dsa/mv88e6171.c +++ b/drivers/net/dsa/mv88e6171.c @@ -24,7 +24,8 @@ static const struct mv88e6xxx_switch_id mv88e6171_table[] = { { PORT_SWITCH_ID_6351, "Marvell 88E6351" }, }; -static char *mv88e6171_probe(struct device *host_dev, int sw_addr) +static char *mv88e6171_probe(struct device *dsa_dev, struct device *host_dev, +int sw_addr) { return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6171_table, ARRAY_SIZE(mv88e6171_table)); diff --git a/drivers/net/dsa/mv88e6352.c b/drivers/net/dsa/mv88e6352.c index 7f452e4a04a5..577ab2cfa944 100644 --- a/drivers/net/dsa/mv88e6352.c +++ b/drivers/net/dsa/mv88e6352.c @@ -37,7 +37,8 @@ static const struct mv88e6xxx_switch_id mv88e6352_table[] = { { PORT_SWITCH_ID_6352_A1, "Marvell 88E6352 (A1)" }, }; -static char *mv88e6352_probe(struct device *host_dev, int sw_addr) +static char *mv88e6352_probe(struct device *dsa_dev, struct device *host_dev, +int sw_addr) { return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6352_table, ARRAY_SIZE(mv88e6352_table)); diff --git a/include/net/dsa.h b/include/net/dsa.h index 18d1be3ad62d..0f9f6f38f255 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -212,7 +212,8 @@ struct dsa_switch_driver { /* * Probing and setup. */ - char*(*probe)(struct device *host_dev, int sw_addr); + char*(*probe)(struct device *dsa_dev, struct device *host_dev, + int sw_addr); int (*setup)(struct dsa_switch *ds); int (*set_addr)(struct dsa_switch *ds, u8 *addr); u32 (*get_phy_flags)(struct dsa_switch *ds, int port); diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index c28c47463b7e..c06275311cb2 100644 --- a/net/dsa/dsa.c +++ b
[PATCH net-next 0/7] DSA refactoring: set 1
There has been a long running effort to refractor DSA probing to make the switches true linux devices. Here are a small collection of patches moving in this direction. Most have been seen before. We take a little step forward by passing the dsa device point to the driver, thus allowing it to perform resource allocations using the normal mechanisms. This device structure will later be replaced by the devices own device structure. Future patches will add a true driver probe function, so we rename the current probe function, cleaning up the namespace. phys_port_mask continually confuses me, thinking it is about PHYs. But it is actually about ports to the outside world, user ports. So rename it. Lots more patches yet to follow, this is just doing some ground work. Andrew Lunn (7): net: dsa: Pass the dsa device to the switch drivers net: dsa: Have the switch driver allocate there own private memory net: dsa: Remove allocation of driver private memory net: dsa: Keep the mii bus and address in the private structure net: dsa: Rename DSA probe function. dsa: Rename phys_port_mask to user_port_mask dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name() drivers/net/dsa/bcm_sf2.c | 24 +--- drivers/net/dsa/mv88e6060.c | 47 +++--- drivers/net/dsa/mv88e6060.h | 11 + drivers/net/dsa/mv88e6123.c | 14 +++- drivers/net/dsa/mv88e6131.c | 14 +++- drivers/net/dsa/mv88e6171.c | 14 +++- drivers/net/dsa/mv88e6352.c | 14 +++- drivers/net/dsa/mv88e6xxx.c | 55 +++-- drivers/net/dsa/mv88e6xxx.h | 17 +++--- include/net/dsa.h | 16 - net/dsa/dsa.c | 19 +--- 11 files changed, 166 insertions(+), 79 deletions(-) -- 2.7.0
[PATCH net-next 2/7] net: dsa: Have the switch driver allocate there own private memory
Now the switch devices have a dev pointer, make use of it for allocating the drivers private data structures using a devm_kzalloc(). Signed-off-by: Andrew Lunn Acked-by: Florian Fainelli --- v2: Added missing assignment of priv to ds->priv. --- drivers/net/dsa/bcm_sf2.c | 10 -- drivers/net/dsa/mv88e6060.c | 3 ++- drivers/net/dsa/mv88e6123.c | 17 ++--- drivers/net/dsa/mv88e6131.c | 17 ++--- drivers/net/dsa/mv88e6171.c | 17 ++--- drivers/net/dsa/mv88e6352.c | 17 ++--- drivers/net/dsa/mv88e6xxx.c | 6 -- drivers/net/dsa/mv88e6xxx.h | 3 +++ include/net/dsa.h | 10 -- net/dsa/dsa.c | 8 +--- 10 files changed, 86 insertions(+), 22 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 18a79579141f..7d62802a66d5 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -136,8 +136,15 @@ static int bcm_sf2_sw_get_sset_count(struct dsa_switch *ds) } static char *bcm_sf2_sw_probe(struct device *dsa_dev, struct device *host_dev, - int sw_addr) + int sw_addr, void **_priv) { + struct bcm_sf2_priv *priv; + + priv = devm_kzalloc(dsa_dev, sizeof(*priv), GFP_KERNEL); + if (!priv) + return NULL; + *_priv = priv; + return "Broadcom Starfighter 2"; } @@ -1363,7 +1370,6 @@ static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int port, static struct dsa_switch_driver bcm_sf2_switch_driver = { .tag_protocol = DSA_TAG_PROTO_BRCM, - .priv_size = sizeof(struct bcm_sf2_priv), .probe = bcm_sf2_sw_probe, .setup = bcm_sf2_sw_setup, .set_addr = bcm_sf2_sw_set_addr, diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c index 34d0f9fe19db..41195f1417f1 100644 --- a/drivers/net/dsa/mv88e6060.c +++ b/drivers/net/dsa/mv88e6060.c @@ -58,11 +58,12 @@ static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val) }) static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr) +int sw_addr, void **priv) { struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev); int ret; + *priv = NULL; if (bus == NULL) return NULL; diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c index ede4c6f0b31a..bcab3ef22448 100644 --- a/drivers/net/dsa/mv88e6123.c +++ b/drivers/net/dsa/mv88e6123.c @@ -30,10 +30,20 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = { }; static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr) +int sw_addr, void **priv) { - return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table, + struct mv88e6xxx_priv_state *ps; + char *name; + + name = mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table, ARRAY_SIZE(mv88e6123_table)); + if (name) { + ps = devm_kzalloc(dsa_dev, sizeof(*ps), GFP_KERNEL); + if (!ps) + return NULL; + *priv = ps; + } + return name; } static int mv88e6123_setup_global(struct dsa_switch *ds) @@ -74,6 +84,8 @@ static int mv88e6123_setup(struct dsa_switch *ds) struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; + ps->ds = ds; + ret = mv88e6xxx_setup_common(ds); if (ret < 0) return ret; @@ -103,7 +115,6 @@ static int mv88e6123_setup(struct dsa_switch *ds) struct dsa_switch_driver mv88e6123_switch_driver = { .tag_protocol = DSA_TAG_PROTO_EDSA, - .priv_size = sizeof(struct mv88e6xxx_priv_state), .probe = mv88e6123_probe, .setup = mv88e6123_setup, .set_addr = mv88e6xxx_set_addr_indirect, diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c index bfadfd2cbb8d..b9f9b009f65a 100644 --- a/drivers/net/dsa/mv88e6131.c +++ b/drivers/net/dsa/mv88e6131.c @@ -26,10 +26,20 @@ static const struct mv88e6xxx_switch_id mv88e6131_table[] = { }; static char *mv88e6131_probe(struct device *dsa_dev, struct device *host_dev, -int sw_addr) +int sw_addr, void **priv) { - return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6131_table, + struct mv88e6xxx_priv_state *ps; + char *name; + + name = mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6131_table, ARRAY_SIZE(mv88e6131_table)); + if (name) { + ps = devm_kzalloc(dsa_dev, sizeof(*ps), GFP_KERNEL); + if (!ps) +
Re: FWD: [PATCH v2] Marvell phy: add fiber status check for some components
On 04/04/16 06:25, Andrew Lunn wrote: >> >From 564b767163d19355a3b5efaad195e93796570c71 Mon Sep 17 00:00:00 2001 >> From: Charles-Antoine Couret >> Date: Fri, 1 Apr 2016 16:16:35 +0200 >> Subject: [PATCH] Marvell phy: add fiber status check for some components >> >> Marvell's phy could have two modes: fiber and copper. Currently, the driver >> checks only the copper mode registers to get the status link which could be >> wrong. >> >> This commit add a handler to check fiber then copper status link. >> If the fiber link is activated, the driver would use this information. >> Else, it would use the copper status. > > Hi Florian > > What do you think about this? > > This works for basic status information. But what about other ethtool > options? Setting the speed and duplex, turning pause on/off, etc. Agreed, it seems like a PHY configured for fiber will need to provide these informations differently, or does the standard BMSR register provide accurate information already? > > Do we actually need to stay on page 1 if fibre is in use? How do we > initially change to page 1 when the fibre link is still down? I also do not feel very comfortable with reading the fiber status first, and then copper and then combine these two. At the very best, could we do something like: - identify if the PHY is configured for fiber in drv->probe or drv->config_init, retain that information - have two paths in drv->read_status which take care of reading one status or the other? Are there other side effects for other register accesses (say, statistics, or auto-negotiation) that need to be fiber vs. copper aware? > > Should we be using the old mechanism to swap between TP, BNC and AUI > to swap between copper and fibre? Did you mean using ethtool -s port fibre for instance? -- Florian
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 11 Apr 2016 09:53:54 -0700 Eric Dumazet wrote: > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote: > > > Drivers also do tricks where they fallback to smaller order pages. E.g. > > lookup function mlx4_alloc_pages(). I've tried to simulate that > > function here: > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > > > We use order-0 pages on mlx4 at Google, as order-3 pages are very > dangerous for some kind of attacks... Interesting! > An out of order TCP packet can hold an order-3 pages, while claiming to > use 1.5 KBvia skb->truesize. > > order-0 only pages allow the page recycle trick used by Intel driver, > and we hardly see any page allocations in typical workloads. Yes, I looked at the Intel ixgbe drivers page recycle trick. It is actually quite cool, but code wise it is a little hard to follow. I started to look at the variant in i40e, specifically function i40e_clean_rx_irq_ps() explains it a bit more explicit. > While order-3 pages are 'nice' for friendly datacenter kind of > traffic, they also are a higher risk on hosts connected to the wild > Internet. > > Maybe I should upstream this patch ;) Definitely! Does this patch also include a page recycle trick? Else how do you get around the cost of allocating a single order-0 page? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH net-next] net: mdio: Fix lockdep falls positive splat
MDIO devices can be stacked upon each other. The current code supports two levels, which until recently has been enough for a DSA mdio bus on top of another bus. Now we have hardware which has an MDIO mux in the middle. Define an MDIO MUTEX class with three levels. Signed-off-by: Andrew Lunn --- drivers/net/phy/mdio-mux.c | 10 ++ drivers/net/phy/mdio_bus.c | 4 ++-- include/linux/mdio.h | 11 +++ 3 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/net/phy/mdio-mux.c b/drivers/net/phy/mdio-mux.c index 308ade0eb1b6..5c81d6faf304 100644 --- a/drivers/net/phy/mdio-mux.c +++ b/drivers/net/phy/mdio-mux.c @@ -45,13 +45,7 @@ static int mdio_mux_read(struct mii_bus *bus, int phy_id, int regnum) struct mdio_mux_parent_bus *pb = cb->parent; int r; - /* In theory multiple mdio_mux could be stacked, thus creating -* more than a single level of nesting. But in practice, -* SINGLE_DEPTH_NESTING will cover the vast majority of use -* cases. We use it, instead of trying to handle the general -* case. -*/ - mutex_lock_nested(&pb->mii_bus->mdio_lock, SINGLE_DEPTH_NESTING); + mutex_lock_nested(&pb->mii_bus->mdio_lock, MDIO_MUTEX_MUX); r = pb->switch_fn(pb->current_child, cb->bus_number, pb->switch_data); if (r) goto out; @@ -76,7 +70,7 @@ static int mdio_mux_write(struct mii_bus *bus, int phy_id, int r; - mutex_lock_nested(&pb->mii_bus->mdio_lock, SINGLE_DEPTH_NESTING); + mutex_lock_nested(&pb->mii_bus->mdio_lock, MDIO_MUTEX_MUX); r = pb->switch_fn(pb->current_child, cb->bus_number, pb->switch_data); if (r) goto out; diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c index 0cba64f1ecf4..751202a285a6 100644 --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -457,7 +457,7 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum) BUG_ON(in_interrupt()); - mutex_lock_nested(&bus->mdio_lock, SINGLE_DEPTH_NESTING); + mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED); retval = bus->read(bus, addr, regnum); mutex_unlock(&bus->mdio_lock); @@ -509,7 +509,7 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val) BUG_ON(in_interrupt()); - mutex_lock_nested(&bus->mdio_lock, SINGLE_DEPTH_NESTING); + mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED); err = bus->write(bus, addr, regnum, val); mutex_unlock(&bus->mdio_lock); diff --git a/include/linux/mdio.h b/include/linux/mdio.h index 5bfd99d1a40a..bf9d1d750693 100644 --- a/include/linux/mdio.h +++ b/include/linux/mdio.h @@ -13,6 +13,17 @@ struct mii_bus; +/* Multiple levels of nesting are possible. However typically this is + * limited to nested DSA like layer, a MUX layer, and the normal + * user. Instead of trying to handle the general case, just define + * these cases. + */ +enum mdio_mutex_lock_class { + MDIO_MUTEX_NORMAL, + MDIO_MUTEX_MUX, + MDIO_MUTEX_NESTED, +}; + struct mdio_device { struct device dev; -- 2.7.0
Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Howells Date: Thu, 07 Apr 2016 17:22:56 +0100 > Okay, I'm in the process of rewriting the RxRPC rewrite. The primary aim of > this second rewrite is to strictly control the number of active connections we > know about and to get rid of connections we don't need much more quickly. > > On top of this, there are fixes to the protocol handling which will all occur > in later parts. Series applied, but I had to fix up some trailing whitespace (reported by GIT) by hand. Thanks.
Re: [PATCH net-next 0/2] fix udp pull header breakage
From: Willem de Bruijn Date: Thu, 7 Apr 2016 11:44:57 -0400 > Commit e6afc8ace6dd ("udp: remove headers from UDP packets before > queueing") modified udp receive processing to pull headers before > enqueue and to not expect them on dequeue. > > The patch missed protocols on top of udp with in-kernel > implementations that have their own skb_recv_datagram calls and > dequeue logic. Modify these datapaths to also no longer expect > a udp header at skb->data. > > Sunrpc and rxrpc are the only two protocols that call this > function and contain references to udphr (some others, like tipc, > are based on encap_rcv, which acts before enqueue, before the > the header pull). Series applied, thanks for fixing this regression so quickly.
Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
On 04/11/2016 10:25 PM, Rob Herring wrote: The PHY devices sometimes do have their reset signal (maybe even power supply?) tied to some GPIO and sometimes it also does happen that a boot loader does not leave it deasserted. So far this issue has been attacked from (as I believe) a wrong angle: by teaching the MAC driver to manipulate the GPIO in question; that solution, when applied to the device trees, led to adding the PHY reset GPIO properties to the MAC device node, with one exception: Cadence MACB driver which could handle the "reset-gpios" prop in a PHY device subnode. I believe that the correct approach is to teach the 'phylib' to get the MDIO device reset GPIO from the device tree node corresponding to this device -- which this patch is doing... Note that I had to modify the AT803x PHY driver as it would stop working otherwise as it made use of the reset GPIO for its own purposes... Lots of double spaces in here. Please fix. Oh, it's you again! :-D Signed-off-by: Sergei Shtylyov --- Documentation/devicetree/bindings/net/phy.txt |2 + drivers/net/phy/at803x.c | 19 ++ drivers/net/phy/mdio_bus.c|4 +++ drivers/net/phy/mdio_device.c | 27 +++-- drivers/net/phy/phy_device.c | 33 -- drivers/of/of_mdio.c | 16 include/linux/mdio.h |3 ++ include/linux/phy.h |5 +++ 8 files changed, 89 insertions(+), 20 deletions(-) [...] Index: net-next/drivers/of/of_mdio.c === --- net-next.orig/drivers/of/of_mdio.c +++ net-next/drivers/of/of_mdio.c @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n static int of_mdiobus_register_phy(struct mii_bus *mdio, struct device_node *child, u32 addr) { + struct gpio_desc *gpiod; struct phy_device *phy; bool is_c45; int rc; @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc is_c45 = of_device_is_compatible(child, "ethernet-phy-ieee802.3-c45"); + gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios"); Calling fwnode_* functions in a DT specific file/function? That doesn't make sense. Really?! 8-) Where is a DT-only analog I wonder... MBR, Sergei
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 11 Apr 2016 19:07:03 +0100 Mel Gorman wrote: > On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote: > > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5 > > > > > > > The cost decreased to: 228 cycles(tsc), but there are some variations, > > sometimes it increase to 238 cycles(tsc). > > > > In the free path, a bulk pcp free adds to the cycles. In the alloc path, > a refill of the pcp lists costs quite a bit. Either option introduces > variances. The bulk free path can be optimised a little so I chucked > some additional patches at it that are not released yet but I suspect the > benefit will be marginal. The real heavy costs there are splitting/merging > buddies. Fixing that is much more fundamental but even fronting the allocator > with a new recycle allocator would not offset that as the refill of the > page-recycling thing would incur high costs. > Yes, re-filling page-pool (in the non-steady state) could be problematic for performance. That is why I'm very motivated in helping out with a bulk alloc/free scheme for the page allocator. > > Nice, but there is still a looong way to my performance target, where I > > can spend 201 cycles for the entire forwarding path > > > > While I accept the cost is still too high, I think the effort should still > be spent on improving the allocator in general than trying to bypass it. > I do think improving the page allocator is very important work. I just don't see how we can ever reach my performance target, without a page-pool recycle facility. I work in the area, where I think the cost of a single atomic operation is too high. I work on amortizing the individual atomic operations. That is what I did for the SLUB allocator, with the bulk API. see: Commit d0ecd894e3d5 ("slub: optimize bulk slowpath free by detached freelist") https://git.kernel.org/torvalds/c/d0ecd894e3d5 Commit fbd02630c6e3 ("slub: initial bulk free implementation") https://git.kernel.org/torvalds/c/fbd02630c6e3 This is now also used in the network stack: Commit 3134b9f019f2 ("Merge branch 'net-mitigate-kmem_free-slowpath'") Commit a3a8749d34d8 ("ixgbe: bulk free SKBs during TX completion cleanup cycle") > > > This is an unreleased series that contains both the page allocator > > > optimisations and the one-LRU-per-node series which in combination remove > > > a > > > lot of code from the page allocator fast paths. I have no data on how the > > > combined series behaves but each series individually is known to improve > > > page allocator performance. > > > > > > Once you have that, do a hackjob to remove the debugging checks from both > > > the > > > alloc and free path and see what that leaves. They could be bypassed > > > properly > > > with a __GFP_NOACCT flag used only by drivers that absolutely require > > > pages > > > as quickly as possible and willing to be less safe to get that > > > performance. > > > > I would be interested in testing/benchmarking a patch where you remove > > the debugging checks... > > > > Right now, I'm not proposing to remove the debugging checks despite their > cost. They catch really difficult problems in the field unfortunately > including corruption from buggy hardware. A GFP flag that disables them > for a very specific case would be ok but I expect it to be resisted by > others if it's done for the general case. Even a static branch for runtime > debugging checks may be resisted. > > Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on > the grounds it is of questionable value. Applying that would free a flag > for __GFP_NOACCT that bypasses debugging checks and statistic updates. > That would work for the allocation side at least but doing the same for > the free side would be hard (potentially impossible) to do transparently > for drivers. Before spending too much work on something, I usually try to determine what the maximum benefit of something would be. Thus, I propose you create a patch that hack remove all the debug checks that you think could be beneficial to remove. And then benchmark it yourself or send it to me for benchmarking... that is the quickest way to determine if this is worth spending time on. > > You are also welcome to try out my benchmarking modules yourself: > > > > https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst > > > > I took a quick look and functionally it's similar to the systemtap-based > microbenchmark I'm using in mmtests so I don't think we have a problem > with reproduction at the moment. > > > > Be aware that compound order allocs like this are a double edged sword as > > > it'll be fast sometimes and other times require reclaim/compaction which > > > can stall for prolonged periods of time. > > > > Yes, I've notice that there can be a fairly high variation, when doing > > compound order allocs, which is not so nice! I really don't like these
Re: [PATCH RFC 5/5] ARM: dts: sun8i: Enable Ethernet controller on the Orange PI PC
On 04/04/16 09:22, Chen-Yu Tsai wrote: > The Orange PI PC uses the H3's internal Ethernet PHY with the EMAC > Ethernet controller. > > Set a proper address for the PHY and enable the EMAC. > > Signed-off-by: Chen-Yu Tsai > --- > > This patch depends on "ARM: dts: sun8i-h3: Add Ethernet controller device", > which uses an binding still in development. > > Do not merge. > > --- > arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts > b/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts > index daf50b9a6657..f01e10df812a 100644 > --- a/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts > +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts > @@ -102,6 +102,20 @@ > status = "okay"; > }; > > +&ephy { > + allwinner,ephy-addr = <0x1>; > +}; > + > +&emac { > + phy = <&phy1>; > + phy-mode = "mii"; > + status = "okay"; > + > + phy1: ethernet-phy@1 { > + reg = <1>; > + }; > +}; As commented in patch 1, the fact that you have to put the Ethernet PHY address twice here is not really a good thing, because they cannot be dissociated from eath other, I would rather have a standard Ethernet PHY DT node represent the desired PHY address, and have your glue/SHIM SUN8I_H3_EMAC driver scan the Device Tree to know what address to program for the Ethernet PHY. -- Florian
Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
On Sat, Apr 09, 2016 at 01:22:47AM +0300, Sergei Shtylyov wrote: > The PHY devices sometimes do have their reset signal (maybe even power > supply?) tied to some GPIO and sometimes it also does happen that a boot > loader does not leave it deasserted. So far this issue has been attacked > from (as I believe) a wrong angle: by teaching the MAC driver to manipulate > the GPIO in question; that solution, when applied to the device trees, > led to adding the PHY reset GPIO properties to the MAC device node, with > one exception: Cadence MACB driver which could handle the "reset-gpios" > prop in a PHY device subnode. I believe that the correct approach is to > teach the 'phylib' to get the MDIO device reset GPIO from the device tree > node corresponding to this device -- which this patch is doing... > > Note that I had to modify the AT803x PHY driver as it would stop working > otherwise as it made use of the reset GPIO for its own purposes... Lots of double spaces in here. Please fix. > > Signed-off-by: Sergei Shtylyov > > --- > Documentation/devicetree/bindings/net/phy.txt |2 + > drivers/net/phy/at803x.c | 19 ++ > drivers/net/phy/mdio_bus.c|4 +++ > drivers/net/phy/mdio_device.c | 27 +++-- > drivers/net/phy/phy_device.c | 33 > -- > drivers/of/of_mdio.c | 16 > include/linux/mdio.h |3 ++ > include/linux/phy.h |5 +++ > 8 files changed, 89 insertions(+), 20 deletions(-) [...] > Index: net-next/drivers/of/of_mdio.c > === > --- net-next.orig/drivers/of/of_mdio.c > +++ net-next/drivers/of/of_mdio.c > @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n > static int of_mdiobus_register_phy(struct mii_bus *mdio, struct device_node > *child, > u32 addr) > { > + struct gpio_desc *gpiod; > struct phy_device *phy; > bool is_c45; > int rc; > @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc > is_c45 = of_device_is_compatible(child, >"ethernet-phy-ieee802.3-c45"); > > + gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios"); Calling fwnode_* functions in a DT specific file/function? That doesn't make sense. > + /* Deassert the reset signal */ > + if (!IS_ERR(gpiod)) > + gpiod_direction_output(gpiod, 0); > if (!is_c45 && !of_get_phy_id(child, &phy_id)) > phy = phy_device_create(mdio, addr, phy_id, 0, NULL); > else > phy = get_phy_device(mdio, addr, is_c45); > + /* Assert the reset signal again */ > + if (!IS_ERR(gpiod)) > + gpiod_set_value(gpiod, 1); > if (IS_ERR_OR_NULL(phy)) > return 1; >
Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY
On 04/04/16 09:22, Chen-Yu Tsai wrote: > The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and > configured through a memory mapped hardware register. > > This same register also configures the MAC interface mode and TX clock > source. Also covered by the register, but not supported in these bindings, > are TX/RX clock delay chains and inverters, and an RMII module. > > Signed-off-by: Chen-Yu Tsai > --- > .../bindings/net/allwinner,sun8i-h3-ephy.txt | 44 > ++ > 1 file changed, 44 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt > > diff --git > a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt > b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt > new file mode 100644 > index ..146f227e6d88 > --- /dev/null > +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt > @@ -0,0 +1,44 @@ > +* Allwinner H3 E(thernet) PHY > + > +The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs, > +before it can be configured over the MDIO bus and used, certain hardware > +features must be configured, such as the PHY address and LED polarity. Is the internal PHY address really configurable? Not that there is anything wrong with it, this is good. > +The PHY must also be powered on and brought out of reset. > + > +This is accomplished with regulators and pull-up/downs for external PHYs. > +For this internal PHY, a hardware register is programmed. > + > +The same hardware register also contains clock and interface controls > +for the MAC. This is also present in earlier SoCs, and is covered by > +"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly > +different due to the inclusion of what appears to be an RMII-MII > +bridge. > + > +Required properties: > +- compatible: should be "allwinner,sun8i-h3-ephy" > +- reg: address and length of the register set for the device > +- clocks: A phandle to the reference clock for this device > +- resets: A phandle to the reset control for this device > + > +Ethernet PHY related properties: > +- allwinner,ephy-addr: the MDIO bus address the PHY should respond to. > +If this is not set, the external PHY is used, and > +everything else in this section is ignored. So we are going to put the same value here, and in the actual Ethernet PHY device tree node that should be hanging off the EMAC/MDIO bus controller, this is confusing and error prone. > +- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are > + active high. > + > +Ethernet MAC clock related properties: > +- #clock-cells: should be 0 > +- clock-output-names: "mac_tx" > + > +Example: > + > +ethernet-phy@01c00030 { > + compatible = "allwinner,sun8i-h3-ephy"; > + reg = <0x01c00030 0x4>; Looking at this register space this looks more like an internal PHY SHIM that is needed to be configured before the internal PHY can be access, not a proper Ethernet PHY per-se, see replies in aptch 2. Should not this block be a second cell associated with the Ethernet MAC block? One or the other are not going to be very useful without knowledge of each other. -- Florian
Re: [PATCH RFC 2/5] net: phy: sun8i-h3-ephy: Add driver for Allwinner H3 Ethernet PHY
On 04/04/16 09:22, Chen-Yu Tsai wrote: > The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and > configured through a memory mapped hardware register. > > This same register also configures the MAC interface mode and TX clock > source. Also covered by the register, but not supported in this driver, > are TX/RX clock delay chains and inverters, and an RMII module. This is not really a PHY driver, more a driver for a special piece of hardware responsible for properly configuring a more standard integrated PHY which is then driven via standard MDIO accesses, right? The intention to make this driver re-usable is fine, but still makes me wonder if it should not be put in a file which is linked into the Ethernet MAC driver, and utilized by this one in a way that may be more "ad-hoc" than what you are proposing here. One thing that is not obvious here, is how is the device parenting done? Are we able to associate a phy_device to this SUN8I_H3_EPHY platform device here? Another thing is that the Ethernet MAC driver is fully aware of when putting an Ethernet PHY into suspend, shutdown, or fully functional power state should occur, if you have a separate platform driver here which does not listen for such kinds of events (hint: none are generated right now), then you cannot implement a working power state interface between the MAC, SHIM and PHY here, even though you would want that. -- Florian
Re: [PATCH net 0/2] tipc: name distributor pernet queue
From: Jon Maloy Date: Thu, 7 Apr 2016 10:40:42 -0400 > Commit #1 fixes a potential issue with deferred binding table > updates being pushed to the wrong namespace. > > Commit #2 solves a problem with deferred binding table updates > remaining in the the defer queue after the issuing node has gone > down. Series applied, thanks.
Re: [PATCH v6 net-next] net: ipv4: Consider failed nexthops in multipath routes
From: David Ahern Date: Thu, 7 Apr 2016 07:21:00 -0700 > Multipath route lookups should consider knowledge about next hops and not > select a hop that is known to be failed. > > Example: ... > The failed path can be avoided by considering known neighbor information > when selecting next hops. If the neighbor lookup fails we have no > knowledge about the nexthop, so give it a shot. If there is an entry > then only select the nexthop if the state is sane. This is similar to > what fib_detect_death does. > > To maintain backward compatibility use of the neighbor information is > based on a new sysctl, fib_multipath_use_neigh. > > Signed-off-by: David Ahern Applied, thanks.
Re: [PATCH 0/2] drivers: net: cpsw: fix ale calls and drop host_port field from cpsw_priv
From: Grygorii Strashko Date: Thu, 7 Apr 2016 15:16:42 +0300 > This clean up series intended to: > - fix port_mask parameters in ale calls and drop unnecessary shifts > - drop host_port field from struct cpsw_priv > > Nothing critical. Tested on am437x-idk-evm in dual mac and switch modes. Looks good, series applied, thanks.
Re: [PATCH net-next] vxlan: fix incorrect type
From: Jiri Benc Date: Mon, 11 Apr 2016 17:06:08 +0200 > The protocol is 16bit, not 32bit. > > Fixes: e1e5314de08ba ("vxlan: implement GPE") > Reported-by: Dan Carpenter > Signed-off-by: Jiri Benc Applied.
Re: [PATCH] net: fjes: Use resource_size
From: Vaishali Thakkar Date: Mon, 11 Apr 2016 15:58:17 +0530 > Use the function resource_size instead of explicit computation. > > Problem found using Coccinelle. > > Signed-off-by: Vaishali Thakkar Applied.
Re: [PATCH RFT 2/2] macb: kill PHY reset code
On 04/11/2016 09:51 PM, Andrew Lunn wrote: The code you are deleting would of ignored the flags in the gpio property, i.e. active low. Hm, you're right -- I forgot about that... :-/ The new code in the previous patch does however take the flags into account. Did you check if there are any device trees which have flags, which were never used, but are now going to be used and thus break... Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. Looks like it needs to be fixed indeed... And this is where it gets tricky. You are breaking backwards compatibility by now respecting the flag. An old DT blob is not going to work. Do we care that much about the DT blobs that are just *wrong*? Wrong, but currently works. Note that it's not only using GPIO_ACTIVE_HIGH but does that against what the MACB binding documents. You potentially need to add a new property and deprecate the old one. I would like to avoid that... You will need the agreement from the at91-vinco maintainer. I'll try submitting a formal DT patch... Andrew MBR, Sergei
Re: [PATCH net-next 0/4] bnxt_en: Update for net-next
From: Michael Chan Date: Mon, 11 Apr 2016 04:11:10 -0400 > Misc. changes for link speed and VF MAC address change. Applied, thanks Michael.
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On 04/11/2016 11:37 AM, Jesper Dangaard Brouer wrote: On Mon, 11 Apr 2016 14:46:25 -0300 Thadeu Lima de Souza Cascardo wrote: So, Jesper, please take into consideration that this pool design would rather be per device. Otherwise, we allow some device to write into another's device/driver memory. Yes, that was my intended use. I want to have a page-pool per device. I actually, want to go as far as a page-pool per NIC HW RX-ring queue. Because the other use-case for the page-pool is zero-copy RX. The NIC HW trick is that we today can create a HW filter in the NIC (via ethtool) and place that traffic into a separate RX queue in the NIC. Lets say matching NFS traffic or guest traffic. Then we can allow RX zero-copy of these pages, into the application/guest, somehow binding it to RX queue, e.g. introducing a "cross-domain-id" in the page-pool page that need to match. I think it is important to keep in mind that using a page pool for zero-copy RX is specific to protocols that are based on TCP/IP. Protocols like FC, SRP and iSER have been designed such that the side that allocates the buffers also initiates the data transfer (the target side). With TCP/IP however transferring data and allocating receive buffers happens by opposite sides of the connection. Bart.
Re: [PATCH RFT 2/2] macb: kill PHY reset code
On Mon, Apr 11, 2016 at 09:39:02PM +0300, Sergei Shtylyov wrote: > Hello. > > On 04/11/2016 09:19 PM, Andrew Lunn wrote: > > >>>The code you are deleting would of ignored the flags in the gpio > >>>property, i.e. active low. > >> > >>Hm, you're right -- I forgot about that... :-/ > >> > >>>The new code in the previous patch does > >>>however take the flags into account. Did you check if there are any > >>>device trees which have flags, which were never used, but are now > >>>going to be used and thus break... > >> > >>Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. > >>Looks like it needs to be fixed indeed... > >> > >And this is where it gets tricky. You are breaking backwards > >compatibility by now respecting the flag. An old DT blob is not going > >to work. > >Do we care that much about the DT blobs that are just *wrong*? Wrong, but currently works. > >You potentially need to add a new property and deprecate the old one. > >I would like to avoid that... You will need the agreement from the at91-vinco maintainer. Andrew
Re: [PATCH RFT 2/2] macb: kill PHY reset code
Hello. On 04/11/2016 09:19 PM, Andrew Lunn wrote: The code you are deleting would of ignored the flags in the gpio property, i.e. active low. Hm, you're right -- I forgot about that... :-/ The new code in the previous patch does however take the flags into account. Did you check if there are any device trees which have flags, which were never used, but are now going to be used and thus break... Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. Looks like it needs to be fixed indeed... And this is where it gets tricky. You are breaking backwards compatibility by now respecting the flag. An old DT blob is not going to work. Do we care that much about the DT blobs that are just *wrong*? You potentially need to add a new property and deprecate the old one. I would like to avoid that... Andrew MBR, Sergei
Re: [PATCH] mwifiex: fix possible NULL dereference
On Mon, Apr 11, 2016 at 8:27 AM, Sudip Mukherjee wrote: > > From: Sudip Mukherjee > > We have a check for card just after dereferencing it. So if it is NULL > we have already dereferenced it before its check. Lets dereference it > after checking card for NULL. > > Signed-off-by: Sudip Mukherjee > --- > drivers/net/wireless/marvell/mwifiex/pcie.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c > b/drivers/net/wireless/marvell/mwifiex/pcie.c > index edf8b07..84562d0 100644 > --- a/drivers/net/wireless/marvell/mwifiex/pcie.c > +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c > @@ -2884,10 +2884,11 @@ static void mwifiex_unregister_dev(struct > mwifiex_adapter *adapter) > { > struct pcie_service_card *card = adapter->card; > const struct mwifiex_pcie_card_reg *reg; > - struct pci_dev *pdev = card->dev; > + struct pci_dev *pdev; you might want to move the variable declaration into the if block below to avoid it being left undefined > int i; > > if (card) { > + pdev = card->dev; to here +struct pci_dev pdev = card->dev; > > if (card->msix_enable) { > for (i = 0; i < MWIFIEX_NUM_MSIX_VECTORS; i++) > synchronize_irq(card->msix_entries[i].vector); > cheers, csd
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 11 Apr 2016 14:46:25 -0300 Thadeu Lima de Souza Cascardo wrote: > So, Jesper, please take into consideration that this pool design > would rather be per device. Otherwise, we allow some device to write > into another's device/driver memory. Yes, that was my intended use. I want to have a page-pool per device. I actually, want to go as far as a page-pool per NIC HW RX-ring queue. Because the other use-case for the page-pool is zero-copy RX. The NIC HW trick is that we today can create a HW filter in the NIC (via ethtool) and place that traffic into a separate RX queue in the NIC. Lets say matching NFS traffic or guest traffic. Then we can allow RX zero-copy of these pages, into the application/guest, somehow binding it to RX queue, e.g. introducing a "cross-domain-id" in the page-pool page that need to match. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net v2] net: sched: do not requeue a NULL skb
On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote: > On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote: > > > I am fine with either way as long as the loop stops on failure. Note that skb that could not be validated is already freed. So I do not see any value from stopping the loop, since we need to schedule the queue to avoid tx hang. Just process following skb if there is one, fact that skb is sent or dropped does not matter.
Re: [PATCH net v2] net: sched: do not requeue a NULL skb
On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote: > I am fine with either way as long as the loop stops on failure. > Folding the test "if (skb)" into one also requires to retake the spinlock. Adding the likely() in this path would probably help as well. diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index f18c35024207..07202d9ac4f6 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -159,12 +159,14 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, if (validate) skb = validate_xmit_skb_list(skb, dev); - if (skb) { + if (likely(skb)) { HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) skb = dev_hard_start_xmit(skb, dev, txq, &ret); HARD_TX_UNLOCK(dev, txq); + } else { + ... does all and return... } spin_lock(root_lock);
Re: Unhandled fault during system suspend in sky2_shutdown
On Mon, 11 Apr 2016 17:24:37 +0100 Sudeep Holla wrote: > Hi, > > I am seeing unhandled fault during system suspend in sky2_shutdown. > I am not sure if it's something missing in the firmware, but just wanted > to check. I see that networkmanager is invoking calling to > netlink_sendmsg which calls sky2_get_stats after the device is shutdown. > > Unhandled fault: synchronous external abort (0x96000210) at > 0x091c2918 > Internal error: : 96000210 [#1] PREEMPT SMP > Modules linked in: > CPU: 3 PID: 2029 Comm: NetworkManager Not tainted 4.6.0-rc3 #126 > Hardware name: ARM Juno development board (r2) (DT) > task: 80007a673000 ti: 800940b5c000 task.ti: 800940b5c000 > PC is at sky2_get_stats+0x44/0x3b8 > LR is at dev_get_stats+0x58/0xc8 > sky2_get_stats+0x44/0x3b8 > rtnl_fill_stats+0x20/0x138 > rtnl_fill_ifinfo+0x440/0xb38 > rtnl_getlink+0xe8/0x198 > rtnetlink_rcv_msg+0xe4/0x220 > netlink_rcv_skb+0xc4/0xf8 > rtnetlink_rcv+0x2c/0x40 > netlink_unicast+0x160/0x238 > netlink_sendmsg+0x2f0/0x358 > sock_sendmsg+0x18/0x30 > ___sys_sendmsg+0x204/0x218 > __sys_sendmsg+0x44/0x88 > SyS_sendmsg+0xc/0x18 > el0_svc_naked+0x24/0x28 > > The below patch is the hack I came up to check if the netdev is detached > and unregistered, I no longer see the issue. > > Regards, > Sudeep > > -->8 > > diff --git i/drivers/net/ethernet/marvell/sky2.c > w/drivers/net/ethernet/marvell/sky2.c > index ec0a22119e09..0ff0434e32fc 100644 > --- i/drivers/net/ethernet/marvell/sky2.c > +++ w/drivers/net/ethernet/marvell/sky2.c > @@ -5220,6 +5220,13 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, > sky2_suspend, sky2_resume); > > static void sky2_shutdown(struct pci_dev *pdev) > { > + struct sky2_hw *hw = pci_get_drvdata(pdev); > + int i; > + > + for (i = hw->ports - 1; i >= 0; --i) { > + sky2_detach(hw->dev[i]); > + unregister_netdev(hw->dev[i]); > + } > sky2_suspend(&pdev->dev); > pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev)); > pci_set_power_state(pdev, PCI_D3hot); This is not the correct fix, the device is supposed to stay registered. The correct way to fix this would be to make get_stats ignore requests for device when suspended.
Re: [PATCH RFT 2/2] macb: kill PHY reset code
> >The code you are deleting would of ignored the flags in the gpio > >property, i.e. active low. > >Hm, you're right -- I forgot about that... :-/ > > >The new code in the previous patch does > >however take the flags into account. Did you check if there are any > >device trees which have flags, which were never used, but are now > >going to be used and thus break... > >Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. > Looks like it needs to be fixed indeed... And this is where it gets tricky. You are breaking backwards compatibility by now respecting the flag. An old DT blob is not going to work. You potentially need to add a new property and deprecate the old one. Andrew
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote: > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5 > > > > The cost decreased to: 228 cycles(tsc), but there are some variations, > sometimes it increase to 238 cycles(tsc). > In the free path, a bulk pcp free adds to the cycles. In the alloc path, a refill of the pcp lists costs quite a bit. Either option introduces variances. The bulk free path can be optimised a little so I chucked some additional patches at it that are not released yet but I suspect the benefit will be marginal. The real heavy costs there are splitting/merging buddies. Fixing that is much more fundamental but even fronting the allocator with a new recycle allocator would not offset that as the refill of the page-recycling thing would incur high costs. > Nice, but there is still a looong way to my performance target, where I > can spend 201 cycles for the entire forwarding path > While I accept the cost is still too high, I think the effort should still be spent on improving the allocator in general than trying to bypass it. > > > This is an unreleased series that contains both the page allocator > > optimisations and the one-LRU-per-node series which in combination remove a > > lot of code from the page allocator fast paths. I have no data on how the > > combined series behaves but each series individually is known to improve > > page allocator performance. > > > > Once you have that, do a hackjob to remove the debugging checks from both > > the > > alloc and free path and see what that leaves. They could be bypassed > > properly > > with a __GFP_NOACCT flag used only by drivers that absolutely require pages > > as quickly as possible and willing to be less safe to get that performance. > > I would be interested in testing/benchmarking a patch where you remove > the debugging checks... > Right now, I'm not proposing to remove the debugging checks despite their cost. They catch really difficult problems in the field unfortunately including corruption from buggy hardware. A GFP flag that disables them for a very specific case would be ok but I expect it to be resisted by others if it's done for the general case. Even a static branch for runtime debugging checks may be resisted. Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on the grounds it is of questionable value. Applying that would free a flag for __GFP_NOACCT that bypasses debugging checks and statistic updates. That would work for the allocation side at least but doing the same for the free side would be hard (potentially impossible) to do transparently for drivers. > You are also welcome to try out my benchmarking modules yourself: > > https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst > I took a quick look and functionally it's similar to the systemtap-based microbenchmark I'm using in mmtests so I don't think we have a problem with reproduction at the moment. > > Be aware that compound order allocs like this are a double edged sword as > > it'll be fast sometimes and other times require reclaim/compaction which > > can stall for prolonged periods of time. > > Yes, I've notice that there can be a fairly high variation, when doing > compound order allocs, which is not so nice! I really don't like these > variations > They can cripple you which is why I'm very wary of performance patches that require compound pages. It tends to look great only on benchmarks and then the corner cases hit in the real world and the bug reports are unpleasant. > Drivers also do tricks where they fallback to smaller order pages. E.g. > lookup function mlx4_alloc_pages(). I've tried to simulate that > function here: > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > It does not seem very optimal. I tried to mem pressure the system a bit > to cause the alloc_pages() to fail, and then the result were very bad, > something like 2500 cycles, and it usually got the next order pages. The options for fallback tend to have one hazard after the next. It's partially why the last series focused on order-0 pages only. > > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to > > > cost approx 500 cycles(tsc). That was more expensive, BUT an order=3 > > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5 > > > cycles. Usually a network RX-frame only need to be 2048 bytes, thus > > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles. > > The order=3 cost were reduced to: 417 cycles(tsc), nice! But I've also > seen it jump to 611 cycles. > The corner cases can be minimised to some extent -- lazy buddy merging for example but it unfortunately has other consequences for users that require high-order pages for functional reasons. I tried something like that once (http://thread.gmane.org/gmane.linux.kernel/807683) b
Re: [PATCH net v2] net: sched: do not requeue a NULL skb
On Mon, Apr 11, 2016 at 8:52 AM, Eric Dumazet wrote: > On Mon, 2016-04-11 at 17:17 +0200, Lars Persson wrote: >> >> On 04/11/2016 04:22 PM, Eric Dumazet wrote: >> > On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote: >> > >> >> I though it would be prudent because the queue can be non-empty even for >> >> the case of skb=NULL. So should it be there in this patch, another patch >> >> or not at all ? >> > >> > Then maybe change return code ? >> > >> > It seems strange that a validate_xmit_skb_list() failure stops the >> > __qdisc_run() loop but schedules another round. >> > >> > >> >> It was suggested by Cong Wang to return 0 in order to stop the loop. Do >> you guys agree that the loop should be stopped for such failures ? Then >> I will put the schedule call inside the if as you proposed earlier. > > What are the causes of validate_xmit_skb_list() failures ? > > If gso segmentations fail because of memory pressure, better free more > skbs right now. > > In any case, having a single test " if (skb) " sounds better to me, > to have a fast path. > > So your first patch was probably a better idea. > > v2 has two tests instead of one. I am fine with either way as long as the loop stops on failure. Folding the test "if (skb)" into one also requires to retake the spinlock.
Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
On Tue, Apr 05, 2016 at 07:56:54PM -0400, David Miller wrote: > From: Cong Wang > Date: Mon, 4 Apr 2016 13:45:02 -0700 > > > On Sat, Apr 2, 2016 at 7:33 PM, Martin KaFai Lau wrote: > >> One thing to note is that this patch uses the addresses from the sk > >> instead of iph when updating sk->sk_dst_cache. It is basically the > >> same logic that the __ip6_datagram_connect() is doing, so some > >> refactoring works in the first two patches. > >> > >> AFAIK, a UDP socket can become connected after sending out some > >> datagrams in un-connected state. or It can be connected > >> multiple times to different destinations. I did some quick > >> tests but I could be wrong. > >> > >> I am thinking if there could be a chance that the skb->data, which > >> has the original outgoing iph, is not related to the current > >> connected address. If it is possible, we have to specifically > >> use the addresses in the sk instead of skb->data (i.e. iph) when > >> updating the sk->sk_dst_cache. > >> > >> If we need to use the sk addresses (and other info) to find out a > >> new dst for a connected udp socket, it is better not doing it while > >> the userland is connecting to somewhere else. > >> > >> If the above case is impossible, we can keep using the info from iph to > >> do the dst update for a connected-udp sk without taking the lock. > > > > I see your point, but calling __ip6_datagram_connect() seems overkill > > here, we don't need to update so many things in the pmtu update context, > > at least IPv4 doesn't do that either. I don't think you have to do that. > > > > So why just updating the dst cache (also some addr cache) here is not > > enough? > > I think we are steadily getting closer to a version of this fix that > we have some agreement on, right? > > Martin can you address Cong's feedback and spin another version of this > series? Dave, I will spin v2 shortly with some minor changes. Cong, I believe the loose ends have been tied up after the last email thread since last Thursday?
Re: [PATCH net v2] net: sched: do not requeue a NULL skb
On Mon, Apr 11, 2016 at 8:17 AM, Lars Persson wrote: > > > On 04/11/2016 04:22 PM, Eric Dumazet wrote: >> >> On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote: >> >>> I though it would be prudent because the queue can be non-empty even for >>> the case of skb=NULL. So should it be there in this patch, another patch >>> or not at all ? >> >> >> Then maybe change return code ? >> >> It seems strange that a validate_xmit_skb_list() failure stops the >> __qdisc_run() loop but schedules another round. >> >> > > It was suggested by Cong Wang to return 0 in order to stop the loop. Do you > guys agree that the loop should be stopped for such failures ? Then I will > put the schedule call inside the if as you proposed earlier. I don't see any reason why we should continue on failure. And, I didn't suggest you to return reschedule it either. I was suggesting to just return 0 for skb == NULL case.
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote: > On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote: > > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote: > > > On arch's like PowerPC, the DMA API is the bottleneck. To workaround > > > the cost of DMA calls, NIC driver alloc large order (compound) pages. > > > (dma_map compound page, handout page-fragments for RX ring, and later > > > dma_unmap when last RX page-fragments is seen). > > > > So, IMO only holding onto the DMA pages is all that is justified but not a > > recycle of order-0 pages built on top of the core allocator. For DMA pages, > > it would take a bit of legwork but the per-cpu allocator could be split > > and converted to hold arbitrary sized pages with a constructer/destructor > > to do the DMA coherency step when pages are taken from or handed back to > > the core allocator. I'm not volunteering to do that unfortunately but I > > estimate it'd be a few days work unless it needs to be per-CPU and NUMA > > aware in which case the memory footprint will be high. > > Have "we" tried to accelerate the DMA calls in PowerPC? For example, it > could hold onto a cache of recently used mappings and recycle them if that > still works. It trades off a bit of security (a device can continue to DMA > after the memory should no longer be accessible to it) for speed, but then > so does the per-driver hack of keeping pages around still mapped. > There are two problems on the DMA calls on Power servers. One is scalability. A new allocation method for the address space would be necessary to fix it. The other one is the latency or the cost of updating the TCE tables. The only number I have is that I could push around 1M updates per second. So, we could guess 1us per operation, which is pretty much a no-no for Jesper use case. Your solution could address both. But I am concerned about the security problem. Here is why I think this problem should be ignored if we go this way. IOMMU can be used for three problems: virtualization, paranoia security and debuggability. For virtualization, there is a solution already, and it's in place for Power and x86. Power servers have the ability to enlarge the DMA window, allowing the entire VM memory to be mapped during PCI driver probe time. After that, dma_map is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even before booting the guest. Unless we want to fix this for old Power servers, I see no point in fixing it. Now, if you are using IOMMU on the host with no passthrough or linear system memory mapping, you are paranoid. It's not just a matter of security, in fact. It's also a matter of stability. Hardware, firmware and drivers can be buggy, and they are. When I worked with drivers on Power servers, I found and fixed a lot of driver bugs that caused the device to write to memory it was not supposed to. Good thing is that IOMMU prevented that memory write to happen and the driver would be reset by EEH. If we can make this scenario faster, and if we want it to be the default we need to, then your solution might not be desired. Otherwise, just turn your IOMMU off or put it into passthrough. Now, the driver keeps pages mapped, but those pages belong to the driver. They are not pages we decide to give to a userspace process because it's no longer in use by the driver. So, I don't quite agree this would be a good tradeoff. Certainly not if we can do it in a way that does not require this. So, Jesper, please take into consideration that this pool design would rather be per device. Otherwise, we allow some device to write into another's device/driver memory. Cascardo.
Re: [PATCH RFT 2/2] macb: kill PHY reset code
Hello. On 04/11/2016 05:28 AM, Andrew Lunn wrote: With the 'phylib' now being aware of the "reset-gpios" PHY node property, there should be no need to frob the PHY reset in this driver anymore... Signed-off-by: Sergei Shtylyov --- drivers/net/ethernet/cadence/macb.c | 17 - drivers/net/ethernet/cadence/macb.h |1 - 2 files changed, 18 deletions(-) Index: net-next/drivers/net/ethernet/cadence/macb.c === --- net-next.orig/drivers/net/ethernet/cadence/macb.c +++ net-next/drivers/net/ethernet/cadence/macb.c [...] @@ -2977,18 +2976,6 @@ static int macb_probe(struct platform_de else macb_get_hwaddr(bp); - /* Power up the PHY if there is a GPIO reset */ - phy_node = of_get_next_available_child(np, NULL); - if (phy_node) { - int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0); - - if (gpio_is_valid(gpio)) { - bp->reset_gpio = gpio_to_desc(gpio); - gpiod_direction_output(bp->reset_gpio, 1); Hi Sergei The code you are deleting would of ignored the flags in the gpio property, i.e. active low. Hm, you're right -- I forgot about that... :-/ The new code in the previous patch does however take the flags into account. Did you check if there are any device trees which have flags, which were never used, but are now going to be used and thus break... Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. Looks like it needs to be fixed indeed... Andrew MBR, Sergei
Re: [PATCH] sh_eth: re-enable-E-MAC interrupts in sh_eth_set_ringparam()
Hello. On 04/10/2016 04:27 AM, David Miller wrote: The E-MAC interrupts are left disabled when the ring parameters are changed via 'ethtool'. In order to fix this, it's enough to call sh_eth_dev_init() with 'true' instead of 'false' for the second argument (which conveniently allows us to remove the following code re-enabling E-DMAC interrupts and reception). Signed-off-by: Sergei Shtylyov Applied, thanks. Thanks! Unfortunately, this patch isn't fit for the stable kernels because there's a prerequisite cleanup (most recent patch merged to this driver). I clearly hadn't thought this out well -- should have enabled the E-MAC interrupts outside sh_eth_dev_init() and then removed the duplicate code like I did here. MBR, Sergei
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote: > Drivers also do tricks where they fallback to smaller order pages. E.g. > lookup function mlx4_alloc_pages(). I've tried to simulate that > function here: > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 We use order-0 pages on mlx4 at Google, as order-3 pages are very dangerous for some kind of attacks... An out of order TCP packet can hold an order-3 pages, while claiming to use 1.5 KBvia skb->truesize. order-0 only pages allow the page recycle trick used by Intel driver, and we hardly see any page allocations in typical workloads. While order-3 pages are 'nice' for friendly datacenter kind of traffic, they also are a higher risk on hosts connected to the wild Internet. Maybe I should upstream this patch ;)
Re: [PATCH net-next 2/2] net: exit busy loop when another process is runnable
On Fri, Aug 22, 2014 at 09:36:53AM +0200, Ingo Molnar wrote: > > > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h > > > index 1d67fb6..8a33fb2 100644 > > > --- a/include/net/busy_poll.h > > > +++ b/include/net/busy_poll.h > > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int > > > nonblock) > > > cpu_relax(); > > > > > > } while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) && > > > - !need_resched() && !busy_loop_timeout(end_time)); > > > + !need_resched() && !busy_loop_timeout(end_time) && > > > + nr_running_this_cpu() < 2); > > So it's generally a bad idea to couple to the scheduler through > such a low level, implementation dependent value like > 'nr_running', causing various problems: > > - It misses important work that might be pending on this CPU, >like RCU callbacks. > > - It will also over-credit task contexts that might be >runnable, but which are less important than the currently >running one: such as a SCHED_IDLE task > > - It will also over-credit even regular SCHED_NORMAL tasks, if >this current task is more important than them: say >SCHED_FIFO. A SCHED_FIFO workload should run just as fast >with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload >on an otherwise idle system. > > So what you want is a more sophisticated query to the > scheduler, a sched_expected_runtime() method that returns the > number of nsecs this task is expected to run in the future, > which returns 0 if you will be scheduled away on the next > schedule(), and returns infinity for a high prio SCHED_FIFO > task, or if this SCHED_NORMAL task is on an otherwise idle CPU. > > It will return a regular time slice value in other cases, when > there's some load on the CPU. > > The polling logic can then do its decision based on that time > value. > > All this can be done reasonably fast and lockless in most > cases, so that it can be called from busy-polling code. > > An added advantage would be that this approach consolidates the > somewhat random need_resched() checks into this method as well. > > In any case I don't agree with the nr_running_this_cpu() > method. > > (Please Cc: me and lkml to future iterations of this patchset.) > > Thanks, > > Ingo I tried to look into this: it might be even nicer to add sched_expected_to_run(time) which tells us whether we expect the current task to keep running for the next XX nsecs. For the fair scheduler, it seems that it could be as simple as +static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t) +{ + struct sched_entity *left; + struct sched_entity *curr = cfs_rq->curr; + + if (!curr || !curr->on_rq) + return false; + + left = __pick_first_entity(cfs_rq); + if (!left) + return true; + + return (s64)(curr->vruntime + calc_delta_fair(t, curr) - +left->vruntime) < 0; +} The reason it seems easier is because that way we can reuse calc_delta_fair and don't have to do the reverse translation from vruntime to nsec. And I guess if we do this with interrupts disabled, and only poke at the current CPU's rq, we know first entity won't go away so we don't need locks? Is this close to what you had in mind? Thanks, -- MST
Unhandled fault during system suspend in sky2_shutdown
Hi, I am seeing unhandled fault during system suspend in sky2_shutdown. I am not sure if it's something missing in the firmware, but just wanted to check. I see that networkmanager is invoking calling to netlink_sendmsg which calls sky2_get_stats after the device is shutdown. Unhandled fault: synchronous external abort (0x96000210) at 0x091c2918 Internal error: : 96000210 [#1] PREEMPT SMP Modules linked in: CPU: 3 PID: 2029 Comm: NetworkManager Not tainted 4.6.0-rc3 #126 Hardware name: ARM Juno development board (r2) (DT) task: 80007a673000 ti: 800940b5c000 task.ti: 800940b5c000 PC is at sky2_get_stats+0x44/0x3b8 LR is at dev_get_stats+0x58/0xc8 sky2_get_stats+0x44/0x3b8 rtnl_fill_stats+0x20/0x138 rtnl_fill_ifinfo+0x440/0xb38 rtnl_getlink+0xe8/0x198 rtnetlink_rcv_msg+0xe4/0x220 netlink_rcv_skb+0xc4/0xf8 rtnetlink_rcv+0x2c/0x40 netlink_unicast+0x160/0x238 netlink_sendmsg+0x2f0/0x358 sock_sendmsg+0x18/0x30 ___sys_sendmsg+0x204/0x218 __sys_sendmsg+0x44/0x88 SyS_sendmsg+0xc/0x18 el0_svc_naked+0x24/0x28 The below patch is the hack I came up to check if the netdev is detached and unregistered, I no longer see the issue. Regards, Sudeep -->8 diff --git i/drivers/net/ethernet/marvell/sky2.c w/drivers/net/ethernet/marvell/sky2.c index ec0a22119e09..0ff0434e32fc 100644 --- i/drivers/net/ethernet/marvell/sky2.c +++ w/drivers/net/ethernet/marvell/sky2.c @@ -5220,6 +5220,13 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, sky2_suspend, sky2_resume); static void sky2_shutdown(struct pci_dev *pdev) { + struct sky2_hw *hw = pci_get_drvdata(pdev); + int i; + + for (i = hw->ports - 1; i >= 0; --i) { + sky2_detach(hw->dev[i]); + unregister_netdev(hw->dev[i]); + } sky2_suspend(&pdev->dev); pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev)); pci_set_power_state(pdev, PCI_D3hot);
Re: pull-request: wireless-drivers-next 2016-04-11
From: Kalle Valo Date: Mon, 11 Apr 2016 15:48:48 +0300 > here's a pull request for 4.7. More features, but nothing really > standing out. Please let me know if you have any problems. Pulled, thanks.
Re: [PATCH net] cxgb4: Stop Rx Queues before freeing it up
From: Hariprasad Shenai Date: Mon, 11 Apr 2016 11:07:58 +0530 > Stop all Ethernet RX Queues before freeing up various Ingress/Egress > Queues, etc. We were seeing cases of Ingress Queues not getting serviced > during the shutdown process leading to Ingress Paths jamming up through > the chip and blocking the shutdown effort itself. > > One such case involved the Firmware sending a "Flush Token" through the > ULP-TX -> ULP-RX path for an Ethernet TX Queue being freed in order to > make sure there weren't any remaining TX Work Requests in the pipeline. > But the return path was stalled by Ingress Data unable to be delivered to > the Host because those Ingress Queues were no longer being serviced. > > Based on original work by Casey Leedom > > Signed-off-by: Hariprasad Shenai Applied, thank you.
Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote: > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote: > > On arch's like PowerPC, the DMA API is the bottleneck. To workaround > > the cost of DMA calls, NIC driver alloc large order (compound) pages. > > (dma_map compound page, handout page-fragments for RX ring, and later > > dma_unmap when last RX page-fragments is seen). > > So, IMO only holding onto the DMA pages is all that is justified but not a > recycle of order-0 pages built on top of the core allocator. For DMA pages, > it would take a bit of legwork but the per-cpu allocator could be split > and converted to hold arbitrary sized pages with a constructer/destructor > to do the DMA coherency step when pages are taken from or handed back to > the core allocator. I'm not volunteering to do that unfortunately but I > estimate it'd be a few days work unless it needs to be per-CPU and NUMA > aware in which case the memory footprint will be high. Have "we" tried to accelerate the DMA calls in PowerPC? For example, it could hold onto a cache of recently used mappings and recycle them if that still works. It trades off a bit of security (a device can continue to DMA after the memory should no longer be accessible to it) for speed, but then so does the per-driver hack of keeping pages around still mapped.