date:20170412

On Thu, 2017-04-06 at 16:31 -0700, Matthias Kaehlcke wrote:
> When clang detects a non-boolean constant in a logical operation it
> generates a 'constant-logical-operand' warning. In
> ieee80211_try_rate_control_ops_get() the result of strlen( str>)
> is used in a logical operation, clang resolves the expression to an
> (integer) constant at compile time when clang's builtin strlen
> function
> is used.
> 
> Change the condition to check for strlen() > 0 to make the constant
> operand boolean and thus avoid the warning.
> 
Applied.

johannes

Re: [PATCH v3 net-next RFC] Generic XDP

On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote:

> > +   if (skb_linearize(skb))
> > +   goto do_drop;
> 
> when we discussed supporting jumbo frames in XDP, the idea
> was that the program would need to look at first 3+k bytes only
> and the rest of the packet will be in non-contiguous pages.
> If we do that, it means that XDP program would have to assume
> that the packet is more than [data, data_end] and this range
> only covers linear part.
> If that's the future, we don't need to linearize the skb here
> and can let the program access headlen only.

I'm not sure how you think that would work - at least with our (wifi)
driver, the headlen should be maybe ETH_HLEN or so at this point. We'd
let the program know that it can only look at so much, but then the
program can't do anything at all with those frames. At some point then
we go back to bpf_skb_load_bytes() being necessary in one form or
another, no?

johannes

Re: [RFC 3/3] mac80211: support bpf monitor filter


> @@ -551,6 +551,9 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t
> priv_data_len,
>      NL80211_FEATURE_FULL_AP_CLIENT_STATE;
>   wiphy_ext_feature_set(wiphy, NL80211_EXT_FEATURE_FILS_STA);
>  
> + if (IS_ENABLED(CONFIG_BPF_WIFIMON))
> + wiphy_ext_feature_isset(wiphy,
> NL80211_EXT_FEATURE_WIFIMON_BPF);
> 
That obviously needs to be _set(), not _isset().

johannes

Re: eBPF - little-endian load instructions?

On Wed, 2017-04-12 at 20:08 -0700, Alexei Starovoitov wrote:

> it's really llvm bug that i need fix. It's plain broken
> to generate what effectively is nop insn for march=bpfeb
> My only excuse that when that code was written llvm had only
> march=bpfel.
> bpfeb was added much later.

So I'm confused now. Is bpf intended to be endian-independent or not?
It sounded at first like it was, even if I have a hard time imagining
how that would even work.

> > #define be32_to_cpu bswap32
> > or
> > #define be32_to_cpu(x) (x)
> > depending on the build architecture, I guess.
> 
> yeah. that's what we should have in bpf_helpers.h

But that sounds more like it isn't.

> ntoh is enough for any networking code,
> so I guess we can live without real bswap insn.

Well, my reason for asking this is that wireless actually as a little-
endian wire protocol, unlike other network stuff :)
(Even at a bit level it's defined to transfer the LSB first, but that
doesn't really get visible at the level of the CPU that can only
address bytes.)

johannes

[Patch net-next v2] net_sched: move the empty tp check from ->destroy() to ->delete()

Roi reported we could have a race condition where in ->classify() path
we dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL.

This is possible because ->destroy() could be called when deleting
a filter to check if we are the last one in tp, this tp is still
linked and visible at that time.

Daniel fixed this in commit d936377414fa
("net, sched: respect rcu grace period on cls destruction"), but
the root cause of this problem is the semantic of ->destroy(), it
does two things (for non-force case):

1) check if tp is empty
2) if tp is empty we could really destroy it

and its caller, if cares, needs to check its return value to see if
it is really destroyed. Therefore we can't unlink tp unless we know
it is empty.

As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().

What's more, even we unlink it before ->destroy(), it could still have
readers since we don't wait for a grace period here, we should not modify
tp->root in ->destroy() either.

Reported-by: Roi Dayan 
Cc: Daniel Borkmann 
Cc: John Fastabend 
Signed-off-by: Cong Wang 
---
 include/net/sch_generic.h |  4 +--
 net/sched/cls_api.c   | 27 +-
 net/sched/cls_basic.c | 10 +++
 net/sched/cls_bpf.c   | 11 
 net/sched/cls_cgroup.c|  8 ++
 net/sched/cls_flow.c  | 10 +++
 net/sched/cls_flower.c| 10 ++-
 net/sched/cls_fw.c| 30 +++-
 net/sched/cls_matchall.c  |  4 +--
 net/sched/cls_route.c | 30 ++--
 net/sched/cls_rsvp.h  | 34 +++
 net/sched/cls_tcindex.c   | 14 +-
 net/sched/cls_u32.c   | 71 +++
 13 files changed, 133 insertions(+), 130 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 65d5026..22e5209 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -204,14 +204,14 @@ struct tcf_proto_ops {
const struct tcf_proto *,
struct tcf_result *);
int (*init)(struct tcf_proto*);
-   bool(*destroy)(struct tcf_proto*, bool);
+   void(*destroy)(struct tcf_proto*);
 
unsigned long   (*get)(struct tcf_proto*, u32 handle);
int (*change)(struct net *net, struct sk_buff *,
struct tcf_proto*, unsigned long,
u32 handle, struct nlattr **,
unsigned long *, bool);
-   int (*delete)(struct tcf_proto*, unsigned long);
+   int (*delete)(struct tcf_proto*, unsigned long, 
bool*);
void(*walk)(struct tcf_proto*, struct tcf_walker 
*arg);
 
/* rtnetlink specific */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 732f7ca..ca1eab9 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -178,14 +178,11 @@ static struct tcf_proto *tcf_proto_create(const char 
*kind, u32 protocol,
return ERR_PTR(err);
 }
 
-static bool tcf_proto_destroy(struct tcf_proto *tp, bool force)
+static void tcf_proto_destroy(struct tcf_proto *tp)
 {
-   if (tp->ops->destroy(tp, force)) {
-   module_put(tp->ops->owner);
-   kfree_rcu(tp, rcu);
-   return true;
-   }
-   return false;
+   tp->ops->destroy(tp);
+   module_put(tp->ops->owner);
+   kfree_rcu(tp, rcu);
 }
 
 void tcf_destroy_chain(struct tcf_proto __rcu **fl)
@@ -194,7 +191,7 @@ void tcf_destroy_chain(struct tcf_proto __rcu **fl)
 
while ((tp = rtnl_dereference(*fl)) != NULL) {
RCU_INIT_POINTER(*fl, tp->next);
-   tcf_proto_destroy(tp, true);
+   tcf_proto_destroy(tp);
}
 }
 EXPORT_SYMBOL(tcf_destroy_chain);
@@ -360,7 +357,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
RCU_INIT_POINTER(*back, next);
tfilter_notify(net, skb, n, tp, fh,
   RTM_DELTFILTER, false);
-   tcf_proto_destroy(tp, true);
+   tcf_proto_destroy(tp);
err = 0;
goto errout;
}
@@ -371,24 +368,28 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
goto errout;
}
} else {
+   bool last;
+
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
if (n->nlmsg_flags & NLM_F_EXCL) {
if (tp_created)
-   tcf_proto_destroy(tp, true);
+

Re: IMX6 FEC connection drops occasionally with 'MDIO read timeout'

2017-04-12 Thread Andy Duan


On 2017年04月13日 00:54, Tim Harvey wrote:
> On Wed, Apr 12, 2017 at 9:26 AM, Fabio Estevam  wrote:
>> Hi Tim,
>>
>> On Wed, Apr 12, 2017 at 1:15 PM, Tim Harvey  wrote:
>>> Andrew,
>>>
>>> Thanks for the reply. Your talking about suspend/resume power
>>> management right? The users reporting this were not using
>>> suspend/resume.
>>>
>>> With regards to clock are you talking about the IPG clock? Is there
>>> any other way that would get turned off other than fec suspend/resume?
>> Yes, through pm_runtime.
>>
>> Can you check if this quick debug change help?
>>
>> --- a/drivers/net/ethernet/freescale/fec_main.c
>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>> @@ -3606,8 +3606,6 @@ static int __maybe_unused
>> fec_runtime_suspend(struct device *dev)
>>  struct net_device *ndev = dev_get_drvdata(dev);
>>  struct fec_enet_private *fep = netdev_priv(ndev);
>>
>> -   clk_disable_unprepare(fep->clk_ipg);
>> -
>>  return 0;
>>   }
>>
>>
>> If you don't see the problem with it, then it means we need to fix the
>> pm runtime support in this driver.
>>
>> Most likely pm runtime is turning off the clocks when it should not.
> Fabio,
>
> Ok, I understand now. We will disable the IPG clock disable and see if
> that makes a difference.
>
> Tim
Firstly, pls try the change suggested by Andrew.
I guess system enter to wait mode, and enet irq cannot wakeup system in 
real time that causes mii irq much latency. then mii bus access timeout.
If so, can you add below change to your dts file and try it ?

iomux pinctrl:
 pinctrl_enet_irq: enetirqgrp {
 fsl,pins = <
MX6QDL_PAD_GPIO_6__ENET_IRQ 0x000b1
 >;
 };


&fec {
 ...
 pinctrl-0 = <&pinctrl_enet &pinctrl_enet_irq>;
 interrupts-extended = <&gpio1 6 0x04>, <&gpc 0 119 0x04>;
 ...
};

&i2c3 {
 ...
 status = "disabled";
};


Regards,
Andy

Re: [PATCH net-next 4/8] net/ncsi: Add debugging infrastructurre

On Wed, Apr 12, 2017 at 09:28:26PM -0700, Joe Perches wrote:
>On Thu, 2017-04-13 at 12:46 +1000, Gavin Shan wrote:
>> This creates procfs directories as NCSI debugging infrastructure.
>> With the patch applied, We will see below procfs directories. Every
>> NCSI package and channel has one corresponding directory. Other than
>> presenting the NCSI topology, No real function has been achieved
>> through these procfs directories so far.
>
>/proc is meant to be stable.
>
>Why not use debugfs?
>

Joe, thanks for the comment. I think debugfs makes more sense than
procfs in this case. I will use it in next respin. The directory
structure won't change, meaning /sys/kernel/debug/ncsi/ will be
created.

>>  /proc/ncsi/eth0
>>  /proc/ncsi/eth0/p0
>>  /proc/ncsi/eth0/p0/c0
>>  /proc/ncsi/eth0/p0/c1
>

Thanks,
Gavin

[PATCH v2 6/9] ftgmac100: Allow configuration of phy interface via device-tree

This uses the standard phy-mode property

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 42 +---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index b1fb729..7c607eb 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1049,7 +1050,7 @@ static void ftgmac100_adjust_link(struct net_device 
*netdev)
schedule_work(&priv->reset_task);
 }
 
-static int ftgmac100_mii_probe(struct ftgmac100 *priv)
+static int ftgmac100_mii_probe(struct ftgmac100 *priv, phy_interface_t intf)
 {
struct net_device *netdev = priv->netdev;
struct phy_device *phydev;
@@ -1061,7 +1062,7 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv)
}
 
phydev = phy_connect(netdev, phydev_name(phydev),
-&ftgmac100_adjust_link, PHY_INTERFACE_MODE_GMII);
+&ftgmac100_adjust_link, intf);
 
if (IS_ERR(phydev)) {
netdev_err(netdev, "%s: Could not attach to PHY\n", 
netdev->name);
@@ -1616,6 +1617,8 @@ static int ftgmac100_setup_mdio(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
struct platform_device *pdev = to_platform_device(priv->dev);
+   int phy_intf = PHY_INTERFACE_MODE_RGMII;
+   struct device_node *np = pdev->dev.of_node;
int i, err = 0;
u32 reg;
 
@@ -1631,6 +1634,39 @@ static int ftgmac100_setup_mdio(struct net_device 
*netdev)
iowrite32(reg, priv->base + FTGMAC100_OFFSET_REVR);
};
 
+   /* Get PHY mode from device-tree */
+   if (np) {
+   /* Default to RGMII. It's a gigabit part after all */
+   phy_intf = of_get_phy_mode(np);
+   if (phy_intf < 0)
+   phy_intf = PHY_INTERFACE_MODE_RGMII;
+
+   /* Aspeed only supports these. I don't know about other IP
+* block vendors so I'm going to just let them through for
+* now. Note that this is only a warning if for some obscure
+* reason the DT really means to lie about it or it's a newer
+* part we don't know about.
+*
+* On the Aspeed SoC there are additionally straps and SCU
+* control bits that could tell us what the interface is
+* (or allow us to configure it while the IP block is held
+* in reset). For now I chose to keep this driver away from
+* those SoC specific bits and assume the device-tree is
+* right and the SCU has been configured properly by pinmux
+* or the firmware.
+*/
+   if (priv->is_aspeed &&
+   phy_intf != PHY_INTERFACE_MODE_RMII &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII_ID &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII_RXID &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII_TXID) {
+   netdev_warn(netdev,
+  "Unsupported PHY mode %s !\n",
+  phy_modes(phy_intf));
+   }
+   }
+
priv->mii_bus->name = "ftgmac100_mdio";
snprintf(priv->mii_bus->id, MII_BUS_ID_SIZE, "%s-%d",
 pdev->name, pdev->id);
@@ -1647,7 +1683,7 @@ static int ftgmac100_setup_mdio(struct net_device *netdev)
goto err_register_mdiobus;
}
 
-   err = ftgmac100_mii_probe(priv);
+   err = ftgmac100_mii_probe(priv, phy_intf);
if (err) {
dev_err(priv->dev, "MII Probe failed!\n");
goto err_mii_probe;
-- 
2.9.3

[PATCH v2 9/9] ftgmac100: Document device-tree binding

Signed-off-by: Benjamin Herrenschmidt 
---
 .../devicetree/bindings/net/ftgmac100.txt  | 36 ++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/ftgmac100.txt

diff --git a/Documentation/devicetree/bindings/net/ftgmac100.txt 
b/Documentation/devicetree/bindings/net/ftgmac100.txt
new file mode 100644
index 000..68a694a
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/ftgmac100.txt
@@ -0,0 +1,36 @@
+* Faraday Technology FTGMAC100 gigabit ethernet controller
+
+Required properties:
+- compatible: "faraday,ftgmac100"
+
+  Must also contain one of these if used as part of an Aspeed AST2400
+  or 2500 family SoC as they have some subtle tweaks to the
+  implementation:
+
+ - "aspeed,ast2400-mac"
+ - "aspeed,ast2500-mac"
+
+- reg: Address and length of the register set for the device
+- interrupts: Should contain ethernet controller interrupt
+
+Optional properties:
+- phy-mode: See ethernet.txt file in the same directory. If the property is
+  absent, "rgmii" is assumed. Supported values are "rgmii" and "rmii"
+- use-ncsi: Use the NC-SI stack instead of an MDIO PHY. Currently assumes
+  rmii (100bT) but kept as a separate property in case NC-SI grows support
+  for a gigabit link.
+- no-hw-checksum: Used to disable HW checksum support. Here for backward
+  compatibility as the driver now should have correct defaults based on
+  the SoC.
+
+Example:
+
+   mac0: ethernet@1e66 {
+   compatible = "aspeed,ast2500-mac", "faraday,ftgmac100";
+   reg = <0x1e66 0x180>;
+   interrupts = <2>;
+   status = "okay";
+   use-ncsi;
+   };
+
+
-- 
2.9.3

[PATCH v2 8/9] ftgmac100: Fix potential ordering issue in NAPI poll

We need to ensure the loads from the descriptor are done after the
MMIO store clearing the interrupts has completed, otherwise we
might still miss work.

A read back from the MMIO register will "push" the posted store and
ioread32 has a barrier on weakly aordered architectures that will
order subsequent accesses.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 71763e4..9b7a24e 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1347,6 +1347,13 @@ static int ftgmac100_poll(struct napi_struct *napi, int 
budget)
 */
iowrite32(FTGMAC100_INT_RXTX,
  priv->base + FTGMAC100_OFFSET_ISR);
+
+   /* Push the above (and provides a barrier vs. subsequent
+* reads of the descriptor).
+*/
+   ioread32(priv->base + FTGMAC100_OFFSET_ISR);
+
+   /* Check RX and TX descriptors for more work to do */
if (ftgmac100_check_rx(priv) ||
ftgmac100_tx_buf_cleanable(priv))
return budget;
-- 
2.9.3

[PATCH v2 3/9] ftgmac100: Add ndo_set_rx_mode() and support for multicast & promisc

This adds the ndo_set_rx_mode() callback to configure the
multicast filters, promisc and allmulti options.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 4be8bf9..551ab3e 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -97,6 +98,10 @@ struct ftgmac100 {
int cur_duplex;
bool use_ncsi;
 
+   /* Multicast filter settings */
+   u32 maht0;
+   u32 maht1;
+
/* Flow control settings */
bool tx_pause;
bool rx_pause;
@@ -264,6 +269,10 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
/* Write MAC address */
ftgmac100_write_mac_addr(priv, priv->netdev->dev_addr);
 
+   /* Write multicast filter */
+   iowrite32(priv->maht0, priv->base + FTGMAC100_OFFSET_MAHT0);
+   iowrite32(priv->maht1, priv->base + FTGMAC100_OFFSET_MAHT1);
+
/* Configure descriptor sizes and increase burst sizes according
 * to values in Aspeed SDK. The FIFO arbitration is enabled and
 * the thresholds set based on the recommended values in the
@@ -317,6 +326,12 @@ static void ftgmac100_start_hw(struct ftgmac100 *priv)
/* Add other bits as needed */
if (priv->cur_duplex == DUPLEX_FULL)
maccr |= FTGMAC100_MACCR_FULLDUP;
+   if (priv->netdev->flags & IFF_PROMISC)
+   maccr |= FTGMAC100_MACCR_RX_ALL;
+   if (priv->netdev->flags & IFF_ALLMULTI)
+   maccr |= FTGMAC100_MACCR_RX_MULTIPKT;
+   else if (netdev_mc_count(priv->netdev))
+   maccr |= FTGMAC100_MACCR_HT_MULTI_EN;
 
/* Hit the HW */
iowrite32(maccr, priv->base + FTGMAC100_OFFSET_MACCR);
@@ -327,6 +342,42 @@ static void ftgmac100_stop_hw(struct ftgmac100 *priv)
iowrite32(0, priv->base + FTGMAC100_OFFSET_MACCR);
 }
 
+static void ftgmac100_calc_mc_hash(struct ftgmac100 *priv)
+{
+   struct netdev_hw_addr *ha;
+
+   priv->maht1 = 0;
+   priv->maht0 = 0;
+   netdev_for_each_mc_addr(ha, priv->netdev) {
+   u32 crc_val = ether_crc_le(ETH_ALEN, ha->addr);
+
+   crc_val = (~(crc_val >> 2)) & 0x3f;
+   if (crc_val >= 32)
+   priv->maht1 |= 1ul << (crc_val - 32);
+   else
+   priv->maht0 |= 1ul << (crc_val);
+   }
+}
+
+static void ftgmac100_set_rx_mode(struct net_device *netdev)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+
+   /* Setup the hash filter */
+   ftgmac100_calc_mc_hash(priv);
+
+   /* Interface down ? that's all there is to do */
+   if (!netif_running(netdev))
+   return;
+
+   /* Update the HW */
+   iowrite32(priv->maht0, priv->base + FTGMAC100_OFFSET_MAHT0);
+   iowrite32(priv->maht1, priv->base + FTGMAC100_OFFSET_MAHT1);
+
+   /* Reconfigure MACCR */
+   ftgmac100_start_hw(priv);
+}
+
 static int ftgmac100_alloc_rx_buf(struct ftgmac100 *priv, unsigned int entry,
  struct ftgmac100_rxdes *rxdes, gfp_t gfp)
 {
@@ -1501,6 +1552,7 @@ static const struct net_device_ops ftgmac100_netdev_ops = 
{
.ndo_validate_addr  = eth_validate_addr,
.ndo_do_ioctl   = ftgmac100_do_ioctl,
.ndo_tx_timeout = ftgmac100_tx_timeout,
+   .ndo_set_rx_mode= ftgmac100_set_rx_mode,
 };
 
 static int ftgmac100_setup_mdio(struct net_device *netdev)
-- 
2.9.3

[PATCH v2 5/9] ftgmac100: Add netpoll support

Just call the interrupt handler with interrupts locally disabled

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 08fe228..b1fb729 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1586,6 +1586,17 @@ static int ftgmac100_set_features(struct net_device 
*netdev,
return 0;
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void ftgmac100_poll_controller(struct net_device *netdev)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   ftgmac100_interrupt(netdev->irq, netdev);
+   local_irq_restore(flags);
+}
+#endif
+
 static const struct net_device_ops ftgmac100_netdev_ops = {
.ndo_open   = ftgmac100_open,
.ndo_stop   = ftgmac100_stop,
@@ -1596,6 +1607,9 @@ static const struct net_device_ops ftgmac100_netdev_ops = 
{
.ndo_tx_timeout = ftgmac100_tx_timeout,
.ndo_set_rx_mode= ftgmac100_set_rx_mode,
.ndo_set_features   = ftgmac100_set_features,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller= ftgmac100_poll_controller,
+#endif
 };
 
 static int ftgmac100_setup_mdio(struct net_device *netdev)
-- 
2.9.3

[PATCH v2 7/9] ftgmac100: Display the discovered PHY device info

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 7c607eb..71763e4 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1075,6 +1075,9 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv, 
phy_interface_t intf)
phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
phydev->advertising = phydev->supported;
 
+   /* Display what we found */
+   phy_attached_info(phydev);
+
return 0;
 }
 
-- 
2.9.3

[PATCH v2 4/9] ftgmac100: Add vlan HW offload

The chip supports HW vlan tag insertion and extraction. Add support
for it.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 46 +++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 551ab3e..08fe228 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -333,6 +334,10 @@ static void ftgmac100_start_hw(struct ftgmac100 *priv)
else if (netdev_mc_count(priv->netdev))
maccr |= FTGMAC100_MACCR_HT_MULTI_EN;
 
+   /* Vlan filtering enabled */
+   if (priv->netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
+   maccr |= FTGMAC100_MACCR_RM_VLAN;
+
/* Hit the HW */
iowrite32(maccr, priv->base + FTGMAC100_OFFSET_MACCR);
 }
@@ -528,6 +533,12 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, 
int *processed)
/* Transfer received size to skb */
skb_put(skb, size);
 
+   /* Extract vlan tag */
+   if ((netdev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
+   (csum_vlan & FTGMAC100_RXDES1_VLANTAG_AVAIL))
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+  csum_vlan & 0x);
+
/* Tear down DMA mapping, do necessary cache management */
map = le32_to_cpu(rxdes->rxdes3);
 
@@ -752,6 +763,13 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
if (skb->ip_summed == CHECKSUM_PARTIAL &&
!ftgmac100_prep_tx_csum(skb, &csum_vlan))
goto drop;
+
+   /* Add VLAN tag */
+   if (skb_vlan_tag_present(skb)) {
+   csum_vlan |= FTGMAC100_TXDES1_INS_VLANTAG;
+   csum_vlan |= skb_vlan_tag_get(skb) & 0x;
+   }
+
txdes->txdes1 = cpu_to_le32(csum_vlan);
 
/* Next descriptor */
@@ -1544,6 +1562,30 @@ static void ftgmac100_tx_timeout(struct net_device 
*netdev)
schedule_work(&priv->reset_task);
 }
 
+static int ftgmac100_set_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+   netdev_features_t changed = netdev->features ^ features;
+
+   if (!netif_running(netdev))
+   return 0;
+
+   /* Update the vlan filtering bit */
+   if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
+   u32 maccr;
+
+   maccr = ioread32(priv->base + FTGMAC100_OFFSET_MACCR);
+   if (priv->netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
+   maccr |= FTGMAC100_MACCR_RM_VLAN;
+   else
+   maccr &= ~FTGMAC100_MACCR_RM_VLAN;
+   iowrite32(maccr, priv->base + FTGMAC100_OFFSET_MACCR);
+   }
+
+   return 0;
+}
+
 static const struct net_device_ops ftgmac100_netdev_ops = {
.ndo_open   = ftgmac100_open,
.ndo_stop   = ftgmac100_stop,
@@ -1553,6 +1595,7 @@ static const struct net_device_ops ftgmac100_netdev_ops = 
{
.ndo_do_ioctl   = ftgmac100_do_ioctl,
.ndo_tx_timeout = ftgmac100_tx_timeout,
.ndo_set_rx_mode= ftgmac100_set_rx_mode,
+   .ndo_set_features   = ftgmac100_set_features,
 };
 
 static int ftgmac100_setup_mdio(struct net_device *netdev)
@@ -1728,7 +1771,8 @@ static int ftgmac100_probe(struct platform_device *pdev)
 
/* Base feature set */
netdev->hw_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
-   NETIF_F_GRO | NETIF_F_SG;
+   NETIF_F_GRO | NETIF_F_SG | NETIF_F_HW_VLAN_CTAG_RX |
+   NETIF_F_HW_VLAN_CTAG_TX;
 
/* AST2400  doesn't have working HW checksum generation */
if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
-- 
2.9.3

[PATCH v2 2/9] ftgmac100: Add pause frames configuration and support

Hopefully my understanding of how the hardware works is correct,
as the documentation isn't completely clear. So far I have seen
no obvious issue. Pause seem to also work with NC-SI.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 96 +++-
 drivers/net/ethernet/faraday/ftgmac100.h |  7 +++
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 66a5065..4be8bf9 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -97,6 +97,11 @@ struct ftgmac100 {
int cur_duplex;
bool use_ncsi;
 
+   /* Flow control settings */
+   bool tx_pause;
+   bool rx_pause;
+   bool aneg_pause;
+
/* Misc */
bool need_mac_restart;
bool is_aspeed;
@@ -217,6 +222,23 @@ static int ftgmac100_set_mac_addr(struct net_device *dev, 
void *p)
return 0;
 }
 
+static void ftgmac100_config_pause(struct ftgmac100 *priv)
+{
+   u32 fcr = FTGMAC100_FCR_PAUSE_TIME(16);
+
+   /* Throttle tx queue when receiving pause frames */
+   if (priv->rx_pause)
+   fcr |= FTGMAC100_FCR_FC_EN;
+
+   /* Enables sending pause frames when the RX queue is past a
+* certain threshold.
+*/
+   if (priv->tx_pause)
+   fcr |= FTGMAC100_FCR_FCTHR_EN;
+
+   iowrite32(fcr, priv->base + FTGMAC100_OFFSET_FCR);
+}
+
 static void ftgmac100_init_hw(struct ftgmac100 *priv)
 {
u32 reg, rfifo_sz, tfifo_sz;
@@ -910,6 +932,7 @@ static void ftgmac100_adjust_link(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
struct phy_device *phydev = netdev->phydev;
+   bool tx_pause, rx_pause;
int new_speed;
 
/* We store "no link" as speed 0 */
@@ -918,8 +941,21 @@ static void ftgmac100_adjust_link(struct net_device 
*netdev)
else
new_speed = phydev->speed;
 
+   /* Grab pause settings from PHY if configured to do so */
+   if (priv->aneg_pause) {
+   rx_pause = tx_pause = phydev->pause;
+   if (phydev->asym_pause)
+   tx_pause = !rx_pause;
+   } else {
+   rx_pause = priv->rx_pause;
+   tx_pause = priv->tx_pause;
+   }
+
+   /* Link hasn't changed, do nothing */
if (phydev->speed == priv->cur_speed &&
-   phydev->duplex == priv->cur_duplex)
+   phydev->duplex == priv->cur_duplex &&
+   rx_pause == priv->rx_pause &&
+   tx_pause == priv->tx_pause)
return;
 
/* Print status if we have a link or we had one and just lost it,
@@ -930,6 +966,8 @@ static void ftgmac100_adjust_link(struct net_device *netdev)
 
priv->cur_speed = new_speed;
priv->cur_duplex = phydev->duplex;
+   priv->rx_pause = rx_pause;
+   priv->tx_pause = tx_pause;
 
/* Link is down, do nothing else */
if (!new_speed)
@@ -961,6 +999,12 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv)
return PTR_ERR(phydev);
}
 
+   /* Indicate that we support PAUSE frames (see comment in
+* Documentation/networking/phy.txt)
+*/
+   phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
+   phydev->advertising = phydev->supported;
+
return 0;
 }
 
@@ -1076,6 +1120,48 @@ static int ftgmac100_set_ringparam(struct net_device 
*netdev,
return 0;
 }
 
+static void ftgmac100_get_pauseparam(struct net_device *netdev,
+struct ethtool_pauseparam *pause)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+
+   pause->autoneg = priv->aneg_pause;
+   pause->tx_pause = priv->tx_pause;
+   pause->rx_pause = priv->rx_pause;
+}
+
+static int ftgmac100_set_pauseparam(struct net_device *netdev,
+   struct ethtool_pauseparam *pause)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+   struct phy_device *phydev = netdev->phydev;
+
+   priv->aneg_pause = pause->autoneg;
+   priv->tx_pause = pause->tx_pause;
+   priv->rx_pause = pause->rx_pause;
+
+   if (phydev) {
+   phydev->advertising &= ~ADVERTISED_Pause;
+   phydev->advertising &= ~ADVERTISED_Asym_Pause;
+
+   if (pause->rx_pause) {
+   phydev->advertising |= ADVERTISED_Pause;
+   phydev->advertising |= ADVERTISED_Asym_Pause;
+   }
+
+   if (pause->tx_pause)
+   phydev->advertising ^= ADVERTISED_Asym_Pause;
+   }
+   if (netif_running(netdev)) {
+   if (phydev && priv->aneg_pause)
+   phy_start_aneg(phydev);
+   else
+   ftgmac100_config_pause(priv);
+   }
+
+   return 0;
+}
+
 static const struct ethtool

[PATCH v2 0/9] ftgmac100: Rework batch 5 - Features

This is the second spin of the fifth and last batch of
updates to the ftgmac100 driver.

This contains a few additional "features" such as:

 - Support for ethtool n-way reset
 - Multicast filtering & promisc support
 - Vlan offload
 - netpoll

And a couple of misc bits. This also adds the device-tree binding
documentation.

v2. - Addresses review comments and adds a new patch fixing a
  theorical ordering issue in my new NAPI poll implementation
- Add a bug fix (Patch 8/9) for a potential ordering issue
  in the new NAPI poll code.

[PATCH v2 1/9] ftgmac100: Add ethtool n-way reset call

A non-wired up implementation accidentally made its way in
a previous patch (Make ring sizes configurable via ethtool).

This removes it and wires up the generic phy_ethtool_nway_reset
instead.

Signed-off-by: Benjamin Herrenschmidt 
--

v2. - Use phy_ethtool_nway_reset() instead of custom implementation
---
 drivers/net/ethernet/faraday/ftgmac100.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 796b37e..66a5065 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1043,13 +1043,6 @@ static void ftgmac100_get_drvinfo(struct net_device 
*netdev,
strlcpy(info->bus_info, dev_name(&netdev->dev), sizeof(info->bus_info));
 }
 
-static int ftgmac100_nway_reset(struct net_device *ndev)
-{
-   if (!ndev->phydev)
-   return -ENXIO;
-   return phy_start_aneg(ndev->phydev);
-}
-
 static void ftgmac100_get_ringparam(struct net_device *netdev,
struct ethtool_ringparam *ering)
 {
@@ -1088,6 +1081,7 @@ static const struct ethtool_ops ftgmac100_ethtool_ops = {
.get_link   = ethtool_op_get_link,
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = phy_ethtool_set_link_ksettings,
+   .nway_reset = phy_ethtool_nway_reset,
.get_ringparam  = ftgmac100_get_ringparam,
.set_ringparam  = ftgmac100_set_ringparam,
 };
-- 
2.9.3

Re: [PATCH net-next 4/8] net/ncsi: Add debugging infrastructurre

2017-04-12 Thread Joe Perches

On Thu, 2017-04-13 at 12:46 +1000, Gavin Shan wrote:
> This creates procfs directories as NCSI debugging infrastructure.
> With the patch applied, We will see below procfs directories. Every
> NCSI package and channel has one corresponding directory. Other than
> presenting the NCSI topology, No real function has been achieved
> through these procfs directories so far.

/proc is meant to be stable.

Why not use debugfs?

>  /proc/ncsi/eth0
>  /proc/ncsi/eth0/p0
>  /proc/ncsi/eth0/p0/c0
>  /proc/ncsi/eth0/p0/c1

Re: [PATCH linux 2/2] net sched actions: fix refcount decrement on error

On Wed, Apr 12, 2017 at 7:21 AM, Wolfgang Bumiller
 wrote:
> If memory allocation for nla_memdup_cookie() fails
> module_put has to be guarded by the same condition as it was
> before the TCA_ACT_COOKIE has been added as stated in the
> comment afterwards:
>
> /* module count goes up only when brand new policy is created
>  * if it exists and is only bound to in a_o->init() then
>  * ACT_P_CREATED is not returned (a zero is).
>  */

Yeah, this patch makes sense for me too. Just one comment below.

>
> Signed-off-by: Wolfgang Bumiller 
> ---
>
> Note that I'm unsure about this patch. The hangups weren't very reliable
> and I couldn't actually reproduce them when building from git/master (as
> I can only test a fraction of my usual workload with it as a lot of my
> data (VMs & containers utilizing veths and tap devices) is on ZFS...).
> In any case it can't harm to take another look at the error handling
> here.
>
>  net/sched/act_api.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 8cc883c063f0..795ac092b723 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -608,15 +608,19 @@ struct tc_action *tcf_action_init_1(struct net *net, 
> struct nlattr *nla,
> int cklen = nla_len(tb[TCA_ACT_COOKIE]);
>
> if (cklen > TC_COOKIE_MAX_SIZE) {
> -   err = -EINVAL;
> tcf_hash_release(a, bind);
> -   goto err_mod;
> +   if (err != ACT_P_CREATED)
> +   module_put(a_o->owner);
> +   err = -EINVAL;
> +   goto err_out;
> }
>
> if (nla_memdup_cookie(a, tb) < 0) {
> -   err = -ENOMEM;
> tcf_hash_release(a, bind);
> -   goto err_mod;
> +   if (err != ACT_P_CREATED)
> +   module_put(a_o->owner);
> +   err = -ENOMEM;
> +   goto err_out;

Instead of duplicating code, you can add the check
to the module_put() next to err_mod label? I mean:

@@ -630,7 +630,8 @@ struct tc_action *tcf_action_init_1(struct net
*net, struct nlattr *nla,
return a;

 err_mod:
-   module_put(a_o->owner);
+   if (err != ACT_P_CREATED)
+   module_put(a_o->owner);
 err_out:
return ERR_PTR(err);
 }

Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread Alexei Starovoitov

On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote:
> 
> This provides a generic SKB based non-optimized XDP path which is used
> if either the driver lacks a specific XDP implementation, or the user
> requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.
> 
> It is arguable that perhaps I should have required something like
> this as part of the initial XDP feature merge.
> 
> I believe this is critical for two reasons:
> 
> 1) Accessibility.  More people can play with XDP with less
>dependencies.  Yes I know we have XDP support in virtio_net, but
>that just creates another depedency for learning how to use this
>facility.
> 
>I wrote this to make life easier for the XDP newbies.
> 
> 2) As a model for what the expected semantics are.  If there is a pure
>generic core implementation, it serves as a semantic example for
>driver folks adding XDP support.
> 
> This is just a rough draft and is untested.
> 
> One thing I have not tried to address here is the issue of
> XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
> incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
> whatever even if the XDP program doesn't try to push headers at all.
> I think we really need the verifier to somehow propagate whether
> certain XDP helpers are used or not.

Looks like we need to relax the headroom requirement.
I really wanted to simplify the life of program writers,
but intel drivers insist on 192 headroom already, then for skb
every driver does 64 and netronome doesn't have the room by default at all
even for XDP and relies on expensive copy when xdp_adjust_head is used
so that dream isn't going to come true.
I guess for now every driver should _try_ to give XDP_PACKET_HEADROOM
bytes, but the driver can omit it completely if xdp_adjust_head()
is not used for performance reasons.

> +static inline bool netif_elide_gro(const struct net_device *dev)
> +{
> + if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
> + return true;
> + return false;
> +}

I think that's the rigth call.
I've been thinking back and forth about it.
On one side it's not cool to disable it and not inform user
about it in ethtool, so it might feel that doing
ethtool_set_one_feature(~GRO) may be cleaner, but then
we'd need to remember the current gro on/off status and
restore that after prog is detached.
And while the prog is attached the user shouldn't be able
to change gro via ethtool -K, so we'd need extra boolean
flag anyway. If we go with this netif_elide_gro() approach,
we don't need to mess with ndo_fix_features() and only
need to hack ethtool_get_features() to report GRO disabled
if (dev->xdp_prog) and mark it as [fixed].
So overall imo it's cleaner than messing with dev->features
directly while attaching/detaching the prog.

> + if (skb_linearize(skb))
> + goto do_drop;

when we discussed supporting jumbo frames in XDP, the idea
was that the program would need to look at first 3+k bytes only
and the rest of the packet will be in non-contiguous pages.
If we do that, it means that XDP program would have to assume
that the packet is more than [data, data_end] and this range
only covers linear part.
If that's the future, we don't need to linearize the skb here
and can let the program access headlen only.
It means that we'd need to add 'len' field to 'struct xdp_md' uapi.
For our existing programs (ddos and l4lb) it's not a problem,
since for MTU < 3.xK and physical XDP the driver guarantees that
data_end - data == len, so nothing will break.
So I'm proposing to do two things:
- drop skb_linearize here
- introduce 'len' to 'struct xdp_md' and update it here and
in the existing drivers that support xdp.

> + if (act == XDP_TX)
> + dev_queue_xmit(skb);

this will go through qdisc which is not a problem per-se,
but may mislead poor users that XDP_TX goes through qdisc
even for in-driver XDP which is not the case.
So I think we need to bypass qdisc somehow. Like
txq = netdev_pick_tx(skb,..
HARD_TX_LOCK(dev, txq..
if (!netif_xmit_stopped(txq)) {
  dev_hard_start_xmit(skb, dev, txq,
} else {
  kfree_skb(skb);
  txq->xdp_tx_full++;
}
HARD_TX_UNLOCK(dev, txq);

this way it will be similar to in-driver XDP which
also has xdp_tx_full counter when HW TX queue is full.

Re: csum and skb->encapsulate issues that were raised in the previous thread

Today all cls_bpf csum helpers are currently disabled for XDP
and if the program mangles the packet and then does XDP_PASS,
the packet will be dropped by the stack due to incorrect csum.
So we're no better here and need a solution for both in-driver XDP and generic 
XDP.

I think the green light to apply this patch will be when
samples/bpf/xdp1, xdp2 and xdp_tx_iptunnel
work just like they do in in-driver XDP and I think we're pretty close.

If desired I can start hacking on this patch and testing mid next week.

Re: [PATCH v2 0/3] of: Make of_match_node() an inline stub for CONFIG_OF=n

2017-04-12 Thread Frank Rowand

On 04/11/17 21:41, Florian Fainelli wrote:
> Hi all,
> 
> This patch series makes of_match_node() an inline stub for CONFIG_OF=n. kbuild
> reported two build errors which are fixed as preriquisite patches.
> 
> This is based on Linus' master, not sure which tree would merge this, Frank's?

It would come in via Rob.

I am not comfortable with patch 3/3 at this moment.

Version 1 of the patch resulted in two errors from the kbuild test robot.  This
version results in another error from the kbuild test robot.  I know it is a
lot of work, but please look at all of the callers of of_match_node() and
check whether any of the other callers will have the same type of error that
the kbuild test robot is catching.

-Frank

> 
> Thanks!
> 
> Florian Fainelli (3):
>   mfd: max8998: Remove CONFIG_OF around max8998_dt_match
>   net: macb: Remove CONFIG_OF around DT match table
>   of: Make of_match_node() an inline stub for CONFIG_OF=n
> 
>  drivers/mfd/max8998.c   | 2 --
>  drivers/net/ethernet/cadence/macb.c | 2 --
>  include/linux/of.h  | 6 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)
>

[PATCH net-next 09/16] net/mlx5e: IPoIB, Basic netdev ndos open/close

Implement open/close of IPoIB netdevice ndos using mlx5e's
channels API to manage data path resources (RQs/SQs/CQs).

Set IPoIB netdev address on dev_init ndo.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  4 +-
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c   | 90 ++-
 3 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5345d875b695..23b92ec54e12 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -883,6 +883,8 @@ typedef int (*mlx5e_fp_hw_modify)(struct mlx5e_priv *priv);
 void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
struct mlx5e_channels *new_chs,
mlx5e_fp_hw_modify hw_modify);
+void mlx5e_activate_priv_channels(struct mlx5e_priv *priv);
+void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv);
 
 void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
   u32 *indirection_rqt, int len,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1fde4e2301a4..eb657987e9b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2547,7 +2547,7 @@ static void mlx5e_build_channels_tx_maps(struct 
mlx5e_priv *priv)
}
 }
 
-static void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
+void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
 {
int num_txqs = priv->channels.num * priv->channels.params.num_tc;
struct net_device *netdev = priv->netdev;
@@ -2567,7 +2567,7 @@ static void mlx5e_activate_priv_channels(struct 
mlx5e_priv *priv)
mlx5e_redirect_rqts_to_channels(priv, &priv->channels);
 }
 
-static void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
+void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
 {
mlx5e_redirect_rqts_to_drop(priv);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index d7d705c840ae..e188d067bc97 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -34,6 +34,18 @@
 #include "en.h"
 #include "ipoib.h"
 
+static int mlx5i_open(struct net_device *netdev);
+static int mlx5i_close(struct net_device *netdev);
+static int  mlx5i_dev_init(struct net_device *dev);
+static void mlx5i_dev_cleanup(struct net_device *dev);
+
+static const struct net_device_ops mlx5i_netdev_ops = {
+   .ndo_open= mlx5i_open,
+   .ndo_stop= mlx5i_close,
+   .ndo_init= mlx5i_dev_init,
+   .ndo_uninit  = mlx5i_dev_cleanup,
+};
+
 /* IPoIB mlx5 netdev profile */
 
 /* Called directly after IPoIB netdevice was created to initialize SW structs 
*/
@@ -52,7 +64,17 @@ static void mlx5i_init(struct mlx5_core_dev *mdev,
mlx5e_build_nic_params(mdev, &priv->channels.params, 
profile->max_nch(mdev));
 
mutex_init(&priv->state_lock);
-   /* TODO : init netdev features here */
+
+   netdev->hw_features|= NETIF_F_SG;
+   netdev->hw_features|= NETIF_F_IP_CSUM;
+   netdev->hw_features|= NETIF_F_IPV6_CSUM;
+   netdev->hw_features|= NETIF_F_GRO;
+   netdev->hw_features|= NETIF_F_TSO;
+   netdev->hw_features|= NETIF_F_TSO6;
+   netdev->hw_features|= NETIF_F_RXCSUM;
+   netdev->hw_features|= NETIF_F_RXHASH;
+
+   netdev->netdev_ops = &mlx5i_netdev_ops;
 }
 
 /* Called directly before IPoIB netdevice is destroyed to cleanup SW structs */
@@ -181,6 +203,72 @@ static const struct mlx5e_profile mlx5i_nic_profile = {
.max_tc= MLX5I_MAX_NUM_TC,
 };
 
+/* mlx5i netdev NDos */
+
+static int mlx5i_dev_init(struct net_device *dev)
+{
+   struct mlx5e_priv*priv   = mlx5i_epriv(dev);
+   struct mlx5i_priv*ipriv  = priv->ppriv;
+
+   /* Set dev address using underlay QP */
+   dev->dev_addr[1] = (ipriv->qp.qpn >> 16) & 0xff;
+   dev->dev_addr[2] = (ipriv->qp.qpn >>  8) & 0xff;
+   dev->dev_addr[3] = (ipriv->qp.qpn) & 0xff;
+
+   return 0;
+}
+
+static void mlx5i_dev_cleanup(struct net_device *dev)
+{
+   /* TODO: detach underlay qp from flow-steering by reset it */
+}
+
+static int mlx5i_open(struct net_device *netdev)
+{
+   struct mlx5e_priv *priv = mlx5i_epriv(netdev);
+   int err;
+
+   mutex_lock(&priv->state_lock);
+
+   set_bit(MLX5E_STATE_OPENED, &priv->state);
+
+   err = mlx5e_open_channels(priv, &priv->channels);
+   if (err)
+   goto err_clear_state_opened_flag;
+
+   mlx5e_refresh_tirs(priv, false);
+   mlx5e_activate_pr

[PATCH net-next 07/16] net/mlx5e: IPoIB, RSS flow steering tables

Like the mlx5e ethernet mode, on IPoIB mode we need to create RX steering
tables, but IPoIB do not require MAC and VLAN steering tables so the
only tables we create in here are:
1. TTC Table (Traffic Type Classifier table for RSS steering)
2. ARFS Table (for accelerated RFS support)

Creation of those tables is identical to mlx5e ethernet mode, hence the
use of mlx5e_create_ttc_table and mlx5e_arfs_create_tables.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h|  4 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c |  7 ++--
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c | 46 +
 3 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index e5518536d56f..c813eab5d764 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -999,6 +999,7 @@ int mlx5e_attr_get(struct net_device *dev, struct 
switchdev_attr *attr);
 void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 void mlx5e_update_hw_rep_counters(struct mlx5e_priv *priv);
 
+/* common netdev helpers */
 int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv);
 
 int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv);
@@ -1010,6 +1011,9 @@ int mlx5e_create_direct_tirs(struct mlx5e_priv *priv);
 void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv);
 void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt);
 
+int mlx5e_create_ttc_table(struct mlx5e_priv *priv, u32 underlay_qpn);
+void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv);
+
 int mlx5e_create_tises(struct mlx5e_priv *priv);
 void mlx5e_cleanup_nic_tx(struct mlx5e_priv *priv);
 int mlx5e_close(struct net_device *netdev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 729904c43801..576d6787b484 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -792,7 +792,7 @@ static int mlx5e_create_ttc_table_groups(struct 
mlx5e_ttc_table *ttc)
return err;
 }
 
-static void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
+void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
 {
struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
 
@@ -800,7 +800,7 @@ static void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
mlx5e_destroy_flow_table(&ttc->ft);
 }
 
-static int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
+int mlx5e_create_ttc_table(struct mlx5e_priv *priv, u32 underlay_qpn)
 {
struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
struct mlx5_flow_table_attr ft_attr = {};
@@ -810,6 +810,7 @@ static int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
ft_attr.max_fte = MLX5E_TTC_TABLE_SIZE;
ft_attr.level = MLX5E_TTC_FT_LEVEL;
ft_attr.prio = MLX5E_NIC_PRIO;
+   ft_attr.underlay_qpn = underlay_qpn;
 
ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
if (IS_ERR(ft->t)) {
@@ -1146,7 +1147,7 @@ int mlx5e_create_flow_steering(struct mlx5e_priv *priv)
priv->netdev->hw_features &= ~NETIF_F_NTUPLE;
}
 
-   err = mlx5e_create_ttc_table(priv);
+   err = mlx5e_create_ttc_table(priv, 0);
if (err) {
netdev_err(priv->netdev, "Failed to create ttc table, err=%d\n",
   err);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index f0318920844e..e16e1c7b246e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -72,6 +72,45 @@ static void mlx5i_cleanup_tx(struct mlx5e_priv *priv)
 {
 }
 
+static int mlx5i_create_flow_steering(struct mlx5e_priv *priv)
+{
+   struct mlx5i_priv *ipriv = priv->ppriv;
+   int err;
+
+   priv->fs.ns = mlx5_get_flow_namespace(priv->mdev,
+  MLX5_FLOW_NAMESPACE_KERNEL);
+
+   if (!priv->fs.ns)
+   return -EINVAL;
+
+   err = mlx5e_arfs_create_tables(priv);
+   if (err) {
+   netdev_err(priv->netdev, "Failed to create arfs tables, 
err=%d\n",
+  err);
+   priv->netdev->hw_features &= ~NETIF_F_NTUPLE;
+   }
+
+   err = mlx5e_create_ttc_table(priv, ipriv->qp.qpn);
+   if (err) {
+   netdev_err(priv->netdev, "Failed to create ttc table, err=%d\n",
+  err);
+   goto err_destroy_arfs_tables;
+   }
+
+   return 0;
+
+err_destroy_arfs_tables:
+   mlx5e_arfs_destroy_tables(priv);
+
+   return err;
+}
+
+static void mlx5i_destroy_flow_steering(struct mlx5e_priv *priv)
+{
+   mlx5e_destroy_ttc_table(priv);
+   mlx5e_arfs_destroy_tables(priv);
+}
+
 static int mlx5i_init_rx(struct mlx5e_priv *priv)
 {
int e

[PATCH net-next 04/16] net/mlx5e: More generic netdev management API

In preparation for mlx5e RDMA net_device support, here we generalize
mlx5e_attach/detach in a way that those functions will be agnostic
to link type.  For that we move ethernet specific NIC net device logic out
of those functions into {nic,rep}_{enable/disable} mlx5e NIC and
representor profiles callbacks.

Also some of the logic was moved only to NIC profile since it is not right
to have this logic for representor net device (e.g. set port MTU).

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  15 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 160 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  |  12 +-
 3 files changed, 96 insertions(+), 91 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b7feecfbb5a5..ced31906b8fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -999,12 +999,6 @@ void mlx5e_cleanup_nic_tx(struct mlx5e_priv *priv);
 int mlx5e_close(struct net_device *netdev);
 int mlx5e_open(struct net_device *netdev);
 void mlx5e_update_stats_work(struct work_struct *work);
-struct net_device *mlx5e_create_netdev(struct mlx5_core_dev *mdev,
-  const struct mlx5e_profile *profile,
-  void *ppriv);
-void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, struct mlx5e_priv *priv);
-int mlx5e_attach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev);
-void mlx5e_detach_netdev(struct mlx5_core_dev *mdev, struct net_device 
*netdev);
 u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout);
 
 int mlx5e_get_offload_stats(int attr_id, const struct net_device *dev,
@@ -1013,4 +1007,13 @@ bool mlx5e_has_offload_stats(const struct net_device 
*dev, int attr_id);
 
 bool mlx5e_is_uplink_rep(struct mlx5e_priv *priv);
 bool mlx5e_is_vf_vport_rep(struct mlx5e_priv *priv);
+
+/* mlx5e generic netdev management API */
+struct net_device*
+mlx5e_create_netdev(struct mlx5_core_dev *mdev, const struct mlx5e_profile 
*profile,
+   void *ppriv);
+int mlx5e_attach_netdev(struct mlx5e_priv *priv);
+void mlx5e_detach_netdev(struct mlx5e_priv *priv);
+void mlx5e_destroy_netdev(struct mlx5e_priv *priv);
+
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8b7b7e604ea0..cdc34ba354c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4121,12 +4121,57 @@ static int mlx5e_init_nic_tx(struct mlx5e_priv *priv)
return 0;
 }
 
+static void mlx5e_register_vport_rep(struct mlx5_core_dev *mdev)
+{
+   struct mlx5_eswitch *esw = mdev->priv.eswitch;
+   int total_vfs = MLX5_TOTAL_VPORTS(mdev);
+   int vport;
+   u8 mac[ETH_ALEN];
+
+   if (!MLX5_CAP_GEN(mdev, vport_group_manager))
+   return;
+
+   mlx5_query_nic_vport_mac_address(mdev, 0, mac);
+
+   for (vport = 1; vport < total_vfs; vport++) {
+   struct mlx5_eswitch_rep rep;
+
+   rep.load = mlx5e_vport_rep_load;
+   rep.unload = mlx5e_vport_rep_unload;
+   rep.vport = vport;
+   ether_addr_copy(rep.hw_id, mac);
+   mlx5_eswitch_register_vport_rep(esw, vport, &rep);
+   }
+}
+
+static void mlx5e_unregister_vport_rep(struct mlx5_core_dev *mdev)
+{
+   struct mlx5_eswitch *esw = mdev->priv.eswitch;
+   int total_vfs = MLX5_TOTAL_VPORTS(mdev);
+   int vport;
+
+   if (!MLX5_CAP_GEN(mdev, vport_group_manager))
+   return;
+
+   for (vport = 1; vport < total_vfs; vport++)
+   mlx5_eswitch_unregister_vport_rep(esw, vport);
+}
+
 static void mlx5e_nic_enable(struct mlx5e_priv *priv)
 {
struct net_device *netdev = priv->netdev;
struct mlx5_core_dev *mdev = priv->mdev;
struct mlx5_eswitch *esw = mdev->priv.eswitch;
struct mlx5_eswitch_rep rep;
+   u16 max_mtu;
+
+   mlx5e_init_l2_addr(priv);
+
+   /* MTU range: 68 - hw-specific max */
+   netdev->min_mtu = ETH_MIN_MTU;
+   mlx5_query_port_max_mtu(priv->mdev, &max_mtu, 1);
+   netdev->max_mtu = MLX5E_HW2SW_MTU(max_mtu);
+   mlx5e_set_dev_port_mtu(priv);
 
mlx5_lag_add(mdev, netdev);
 
@@ -4141,6 +4186,8 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv)
mlx5_eswitch_register_vport_rep(esw, 0, &rep);
}
 
+   mlx5e_register_vport_rep(mdev);
+
if (netdev->reg_state != NETREG_REGISTERED)
return;
 
@@ -4152,6 +4199,12 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv)
}
 
queue_work(priv->wq, &priv->set_rx_mode_work);
+
+   rtnl_lock();
+   if (netif_running(netdev))
+   mlx5e_open(netdev);
+   net

[PATCH net-next 02/16] net/mlx5: Refactor create flow table method to accept underlay QP

From: Erez Shitrit 

IB flow tables need the underlay qp to perform flow steering.
Here we change the API of the flow tables creation to accept the
underlay QP number as a parameter in order to support IB (IPoIB) flow
steering.

Signed-off-by: Erez Shitrit 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c  | 10 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c| 25 +--
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  5 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 16 +++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  8 +++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 84 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |  1 +
 include/linux/mlx5/fs.h| 14 ++--
 8 files changed, 113 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index c4e9cc79f5c7..c8a005326e30 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -321,10 +321,16 @@ static int arfs_create_table(struct mlx5e_priv *priv,
 {
struct mlx5e_arfs_tables *arfs = &priv->fs.arfs;
struct mlx5e_flow_table *ft = &arfs->arfs_tables[type].ft;
+   struct mlx5_flow_table_attr ft_attr = {};
int err;
 
-   ft->t = mlx5_create_flow_table(priv->fs.ns, MLX5E_NIC_PRIO,
-  MLX5E_ARFS_TABLE_SIZE, 
MLX5E_ARFS_FT_LEVEL, 0);
+   ft->num_groups = 0;
+
+   ft_attr.max_fte = MLX5E_ARFS_TABLE_SIZE;
+   ft_attr.level = MLX5E_ARFS_FT_LEVEL;
+   ft_attr.prio = MLX5E_NIC_PRIO;
+
+   ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
if (IS_ERR(ft->t)) {
err = PTR_ERR(ft->t);
ft->t = NULL;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 5376d69a6b1a..729904c43801 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -803,11 +803,15 @@ static void mlx5e_destroy_ttc_table(struct mlx5e_priv 
*priv)
 static int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
 {
struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
+   struct mlx5_flow_table_attr ft_attr = {};
struct mlx5e_flow_table *ft = &ttc->ft;
int err;
 
-   ft->t = mlx5_create_flow_table(priv->fs.ns, MLX5E_NIC_PRIO,
-  MLX5E_TTC_TABLE_SIZE, 
MLX5E_TTC_FT_LEVEL, 0);
+   ft_attr.max_fte = MLX5E_TTC_TABLE_SIZE;
+   ft_attr.level = MLX5E_TTC_FT_LEVEL;
+   ft_attr.prio = MLX5E_NIC_PRIO;
+
+   ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
if (IS_ERR(ft->t)) {
err = PTR_ERR(ft->t);
ft->t = NULL;
@@ -973,12 +977,16 @@ static int mlx5e_create_l2_table(struct mlx5e_priv *priv)
 {
struct mlx5e_l2_table *l2_table = &priv->fs.l2;
struct mlx5e_flow_table *ft = &l2_table->ft;
+   struct mlx5_flow_table_attr ft_attr = {};
int err;
 
ft->num_groups = 0;
-   ft->t = mlx5_create_flow_table(priv->fs.ns, MLX5E_NIC_PRIO,
-  MLX5E_L2_TABLE_SIZE, MLX5E_L2_FT_LEVEL, 
0);
 
+   ft_attr.max_fte = MLX5E_L2_TABLE_SIZE;
+   ft_attr.level = MLX5E_L2_FT_LEVEL;
+   ft_attr.prio = MLX5E_NIC_PRIO;
+
+   ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
if (IS_ERR(ft->t)) {
err = PTR_ERR(ft->t);
ft->t = NULL;
@@ -1076,11 +1084,16 @@ static int mlx5e_create_vlan_table_groups(struct 
mlx5e_flow_table *ft)
 static int mlx5e_create_vlan_table(struct mlx5e_priv *priv)
 {
struct mlx5e_flow_table *ft = &priv->fs.vlan.ft;
+   struct mlx5_flow_table_attr ft_attr = {};
int err;
 
ft->num_groups = 0;
-   ft->t = mlx5_create_flow_table(priv->fs.ns, MLX5E_NIC_PRIO,
-  MLX5E_VLAN_TABLE_SIZE, 
MLX5E_VLAN_FT_LEVEL, 0);
+
+   ft_attr.max_fte = MLX5E_VLAN_TABLE_SIZE;
+   ft_attr.level = MLX5E_VLAN_FT_LEVEL;
+   ft_attr.prio = MLX5E_NIC_PRIO;
+
+   ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
 
if (IS_ERR(ft->t)) {
err = PTR_ERR(ft->t);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index fcd5bc7e31db..b3281d1118b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -337,6 +337,7 @@ esw_fdb_set_vport_promisc_rule(struct mlx5_eswitch *esw, 
u32 vport)
 static int esw_create_legacy_fdb_table(struct mlx5_eswitch *esw, int nvports)
 {
int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+   struct mlx5_flow_table_attr ft_attr = {};
struct mlx5_core_dev *dev = esw->dev;
struct mlx5_flow_nam

[PATCH net-next 00/16] Mellanox, mlx5 RDMA net device support

Hi Dave and Doug.

This series provides the lower level mlx5 support of RDMA netdevice
creation API [1] suggested and introduced by Intel's HFI OPA VNIC
netdevice driver [2], to enable IPoIB mlx5 RDMA netdevice creation.

mlx5 IPoIB RDMA netdev will serve as an acceleration netdevice for the current
IPoIB ULP generic netdevice, providing:
- mlx5 RSS support.
- mlx5 HW RX,TX offloads (checksum, TSO, LRO, etc ..).
- Full mlx5 HW features transparent to the ULP itself.

The idea here is to reuse and benefit from the already implemented mlx5e
netdevice
management and channels API for both etherent and RDMA netdevices, since both
IPoIB
and Ethernet netdevices share same common mlx5 HW resources (with some small
exceptions) and share most of the control/data path logic, it is more natural to
have them share the same code.

The differences between IPoIB and Ethernet netdevices can be summarized to:

Steering:
In mlx5, IPoIB traffic is sent and received from an underlay special QP, and in
Ethernet
the traffic is handled by vports and vport steering is managed by e-switch or
FW.

For IPoIB traffic to get steered correctly the only thing we need to do is to
create RSS
HW contexts for RX and TX HW contexts for TX (similar to mlx5e) with the
underlay QP attached to
them (underlay QP will be 0 in case of Ethernet).

RX,TX:
Since IPoIB traffic is different, slightly modified RX and TX handlers are
required,
still we do some code reuse in data path via common helper functions.

All of the other generic netdevice and mlx5 aspects will be shared between mlx5
Ethernet
and IPoIB netdevices, e.g.
- Channels creation and handling (RQs,SQs,CQs, NAPI, interrupt
moderation, etc..)
- Offloads, checksum, GRO, LRO, TSO, and more.
- netdevice logic and non Ethernet specific ndos (open/close, etc..)

In order to achieve what we want:

In patchet 1 to 3, Erez added the supported for underlay QP in mlx5_ifc and
refactored
the mlx5 steering code to accept the underlay QP as a parameter for creating
steering
objects and enabled flow steering for IB link.

Then we are going to use the mlx5e netdevice profile, which is already used to
separate between
NIC and VF representors netdevices, to create new type of IPoIB netdevice
profile.

For that, one small refactoring is required to make mlx5e netdevice profile
management
more genetic and agnostic to link type which is done in patch #4.

In patch #5, we introduce ipoib.c to host all of mlx5 IPoIB (mlx5i) specific
logic and a
skeleton for the IPoIB mlx5 netdevice profile, and we will start filling it in
next patches,
using mlx5e already existing APIs.

Patch #6 and #7, Implement init/cleanup RX mlx5i netdev profile handlers to
create mlx5 RSS
resources, same as mlx5e but without vlan and L2 steering tables.

Patch #8, Implement init/cleanup TX mlx5i netdev profile handlers, to create TX
resources
same as mlx5e but with one TC (tc = 0) support.

Patch #9, Implement mlx5i open/close ndos, where we reuese the mlx5e channels
API, to start/stop TX/RX channels.

Patch #10, Create the underlay QP and attach it to mlx5i RSS and TX HW contexts.

Patch #11 and #12, Break down the mlx5e xmit flow into smaller helper function
and implement the
mlx5i IPoIB xmit routine.

Patch #13 and #14, Have an RX handler per netdevice profile. We already do this
before this series
in a non clean way to separate between NIC netdev and VF representor RX
handlers, in patch 13 we make
the RX handler generic and bound to a profile and in patch 14 we implement the
IPoIB RX handlers.

Patch #15, Small cleanup to avoid e-switch with IPoIB netdev.

In order to enable mlx5 IPoIB, a merge between the IPoIB RDMA netdev offolad
support [3]
- which was alread submitted to the rdma mailing list - and this series is
required
plus an extra small patch [4] which will connect between both sides and
actually enables the offload.

Once both patch-sets are merged into linux we will have to submit the extra
small patch [4], to enable
the feature.

Thanks,
Saeed.

[1] https://patchwork.kernel.org/patch/9676637/

[2] https://lwn.net/Articles/715453/
https://patchwork.kernel.org/patch/9587815/

[3] https://patchwork.kernel.org/patch/9672069/
[4]
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?id=0141db6a686e32294dee015b7d07706162ba48d8

Erez Shitrit (4):
net/mlx5: Add IPoIB enhanced offloads bits to mlx5_ifc
net/mlx5: Refactor create flow table method to accept underlay QP
net/mlx5: Enable flow-steering for IB link
hw/mlx5: Add New bit to check over QP creation

Saeed Mahameed (12):
net/mlx5e: More generic netdev management API
net/mlx5e: IPoIB, Add netdevice profile skeleton
net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs
net/mlx5e: IPoIB, RSS flow steering tables
net/mlx5e: IPoIB, TX TIS creation
net/mlx5e: IPoIB, Basic netdev ndos open/close
net/mlx5e: IPoIB, Underlay QP
net/mlx5e: Xmit flow break down
net/

[PATCH net-next 03/16] net/mlx5: Enable flow-steering for IB link

From: Erez Shitrit 

Get the relevant capabilities if supports ipoib_enhanced_offloads and
init the flow steering table accordingly.

Signed-off-by: Erez Shitrit 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 11 ---
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |  3 ++-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 55182d0b06e8..b8a176503d38 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1905,9 +1905,6 @@ void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
 {
struct mlx5_flow_steering *steering = dev->priv.steering;
 
-   if (MLX5_CAP_GEN(dev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
-   return;
-
cleanup_root_ns(steering->root_ns);
cleanup_root_ns(steering->esw_egress_root_ns);
cleanup_root_ns(steering->esw_ingress_root_ns);
@@ -2010,9 +2007,6 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
struct mlx5_flow_steering *steering;
int err = 0;
 
-   if (MLX5_CAP_GEN(dev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
-   return 0;
-
err = mlx5_init_fc_stats(dev);
if (err)
return err;
@@ -2023,7 +2017,10 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
steering->dev = dev;
dev->priv.steering = steering;
 
-   if (MLX5_CAP_GEN(dev, nic_flow_table) &&
+   if MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_ETH) &&
+ (MLX5_CAP_GEN(dev, nic_flow_table))) ||
+((MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) &&
+ MLX5_CAP_GEN(dev, ipoib_enhanced_offloads))) &&
MLX5_CAP_FLOWTABLE_NIC_RX(dev, ft_support)) {
err = init_root_ns(steering);
if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index d0bbefa08af7..1bc14d0fded8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -137,7 +137,8 @@ int mlx5_query_hca_caps(struct mlx5_core_dev *dev)
return err;
}
 
-   if (MLX5_CAP_GEN(dev, nic_flow_table)) {
+   if (MLX5_CAP_GEN(dev, nic_flow_table) ||
+   MLX5_CAP_GEN(dev, ipoib_enhanced_offloads)) {
err = mlx5_core_get_caps(dev, MLX5_CAP_FLOW_TABLE);
if (err)
return err;
-- 
2.11.0

[PATCH net-next 4/8] net/ncsi: Add debugging infrastructurre

This creates procfs directories as NCSI debugging infrastructure.
With the patch applied, We will see below procfs directories. Every
NCSI package and channel has one corresponding directory. Other than
presenting the NCSI topology, No real function has been achieved
through these procfs directories so far.

 /proc/ncsi/eth0
 /proc/ncsi/eth0/p0
 /proc/ncsi/eth0/p0/c0
 /proc/ncsi/eth0/p0/c1

Signed-off-by: Gavin Shan 
---
 net/ncsi/Kconfig   |  9 +
 net/ncsi/Makefile  |  1 +
 net/ncsi/internal.h| 45 
 net/ncsi/ncsi-debug.c  | 95 ++
 net/ncsi/ncsi-manage.c | 16 +
 5 files changed, 166 insertions(+)
 create mode 100644 net/ncsi/ncsi-debug.c

diff --git a/net/ncsi/Kconfig b/net/ncsi/Kconfig
index 08a8a60..b73ce24 100644
--- a/net/ncsi/Kconfig
+++ b/net/ncsi/Kconfig
@@ -10,3 +10,12 @@ config NET_NCSI
  support. Enable this only if your system connects to a network
  device via NCSI and the ethernet driver you're using supports
  the protocol explicitly.
+
+config NET_NCSI_DEBUG
+   bool "Enable NCSI debugging"
+   depends on NET_NCSI && PROC_FS
+   default n
+   ---help---
+ This enables the interfaces (e.g. procfs) for NCSI debugging purpose.
+
+ If unsure, say Y.
diff --git a/net/ncsi/Makefile b/net/ncsi/Makefile
index dd12b56..2897fa0 100644
--- a/net/ncsi/Makefile
+++ b/net/ncsi/Makefile
@@ -2,3 +2,4 @@
 # Makefile for NCSI API
 #
 obj-$(CONFIG_NET_NCSI) += ncsi-cmd.o ncsi-rsp.o ncsi-aen.o ncsi-manage.o
+obj-$(CONFIG_NET_NCSI_DEBUG) += ncsi-debug.o
diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 1308a56..2a08168 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -198,6 +198,9 @@ struct ncsi_channel {
} monitor;
struct list_headnode;
struct list_headlink;
+#ifdef CONFIG_NET_NCSI_DEBUG
+   struct proc_dir_entry   *pde;   /* Procfs directory*/
+#endif
 };
 
 struct ncsi_package {
@@ -208,6 +211,9 @@ struct ncsi_package {
unsigned int channel_num; /* Number of channels */
struct list_head channels;/* List of chanels*/
struct list_head node;/* Form list of packages  */
+#ifdef CONFIG_NET_NCSI_DEBUG
+   struct proc_dir_entry *pde;   /* Procfs directory   */
+#endif
 };
 
 struct ncsi_request {
@@ -276,6 +282,9 @@ struct ncsi_dev_priv {
struct work_struct  work;/* For channel management */
struct packet_type  ptype;   /* NCSI packet Rx handler */
struct list_headnode;/* Form NCSI device list  */
+#ifdef CONFIG_NET_NCSI_DEBUG
+   struct proc_dir_entry*pde;   /* Procfs directory   */
+#endif
 };
 
 struct ncsi_cmd_arg {
@@ -337,4 +346,40 @@ int ncsi_rcv_rsp(struct sk_buff *skb, struct net_device 
*dev,
 struct packet_type *pt, struct net_device *orig_dev);
 int ncsi_aen_handler(struct ncsi_dev_priv *ndp, struct sk_buff *skb);
 
+/* Debugging functionality */
+#ifdef CONFIG_NET_NCSI_DEBUG
+int ncsi_dev_init_debug(struct ncsi_dev_priv *ndp);
+void ncsi_dev_release_debug(struct ncsi_dev_priv *ndp);
+int ncsi_package_init_debug(struct ncsi_package *np);
+void ncsi_package_release_debug(struct ncsi_package *np);
+int ncsi_channel_init_debug(struct ncsi_channel *nc);
+void ncsi_channel_release_debug(struct ncsi_channel *nc);
+#else
+static inline int ncsi_dev_init_debug(struct ncsi_dev_priv *ndp)
+{
+   return -ENOTTY;
+}
+
+static inline void ncsi_dev_release_debug(struct ncsi_dev_priv *ndp)
+{
+}
+
+static inline int ncsi_package_init_debug(struct ncsi_package *np)
+{
+   return -ENOTTY;
+}
+
+static inline void ncsi_package_release_debug(struct ncsi_package *np)
+{
+}
+
+static inline int ncsi_channel_init_debug(struct ncsi_channel *nc)
+{
+   return -ENOTTY;
+}
+
+static inline void ncsi_channel_release_debug(struct ncsi_channel *nc)
+{
+}
+#endif /* CONFIG_NET_NCSI_DEBUG */
 #endif /* __NCSI_INTERNAL_H__ */
diff --git a/net/ncsi/ncsi-debug.c b/net/ncsi/ncsi-debug.c
new file mode 100644
index 000..1909b00
--- /dev/null
+++ b/net/ncsi/ncsi-debug.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright Gavin Shan, IBM Corporation 2017.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "internal.h"
+#include "ncsi-pkt.h"
+
+static struct proc_dir_entry *ncsi_pde;
+
+int ncsi_dev_init_debug(struct ncsi_dev_priv *ndp)
+{
+   if (WARN_ON_ONCE(ndp->pde))
+   return 0;
+
+   if (!ncsi_pde) {
+   ncsi_pde = proc_mkdir("ncsi", NULL);

[PATCH net-next 01/16] net/mlx5: Add IPoIB enhanced offloads bits to mlx5_ifc

From: Erez Shitrit 

New capability bit: ipoib_enhanced_offloads, indicates new ability for UD
QP to do RSS and enhanced IPoIB offloads and acceleration.

Add underlay_qpn to the TIS and flow_table objects In order to support
SET_ROOT command, to connect between IPoIB QPs and flow steering tables.

Signed-off-by: Erez Shitrit 
Signed-off-by: Saeed Mahameed 
---
 include/linux/mlx5/mlx5_ifc.h | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 1993adbd2c82..7c50bd39b297 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -872,7 +872,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 
u8 compact_address_vector[0x1];
u8 striding_rq[0x1];
-   u8 reserved_at_202[0x2];
+   u8 reserved_at_202[0x1];
+   u8 ipoib_enhanced_offloads[0x1];
u8 ipoib_basic_offloads[0x1];
u8 reserved_at_205[0xa];
u8 drain_sigerr[0x1];
@@ -2293,7 +2294,9 @@ struct mlx5_ifc_tisc_bits {
u8 reserved_at_120[0x8];
u8 transport_domain[0x18];
 
-   u8 reserved_at_140[0x3c0];
+   u8 reserved_at_140[0x8];
+   u8 underlay_qpn[0x18];
+   u8 reserved_at_160[0x3a0];
 };
 
 enum {
@@ -8218,7 +8221,9 @@ struct mlx5_ifc_set_flow_table_root_in_bits {
u8 reserved_at_a0[0x8];
u8 table_id[0x18];
 
-   u8 reserved_at_c0[0x140];
+   u8 reserved_at_c0[0x8];
+   u8 underlay_qpn[0x18];
+   u8 reserved_at_e0[0x120];
 };
 
 enum {
-- 
2.11.0

[PATCH net-next 05/16] net/mlx5e: IPoIB, Add netdevice profile skeleton

Create mlx5e IPoIB netdevice profile skeleton in the new ipoib.c
file with empty implementation.

Downstream patches will provide the full mlx5 rdma netdevice acceleration
support for IPoIB into this new file, by using the mlx5e netdevice
profile and new mlx5_channels APIs and infrastructures.
Same as already done in mlx5e NIC netdevice and switchdev mode VF
representors.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|   7 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 -
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c| 181 +
 .../ethernet/mellanox/mlx5/core/ipoib.h}   |  22 ++-
 6 files changed, 215 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
 copy drivers/{infiniband/hw/mlx5/cmd.h => 
net/ethernet/mellanox/mlx5/core/ipoib.h} (77%)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 117170014e88..a84b652f9b54 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -31,3 +31,10 @@ config MLX5_CORE_EN_DCB
  This flag is depended on the kernel's DCB support.
 
  If unsure, set to Y
+
+config MLX5_CORE_IPOIB
+   bool "Mellanox Technologies ConnectX-4 IPoIB offloads support"
+   depends on MLX5_CORE_EN
+   default y
+   ---help---
+ MLX5 IPoIB offloads & acceleration support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9f43beb86250..9e644615f07a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -11,3 +11,5 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o 
eswitch_offloads.o \
en_tc.o en_arfs.o en_rep.o en_fs_ethtool.o en_selftest.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
+
+mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ced31906b8fd..02aa3cc59dc3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -153,6 +154,14 @@ static inline int mlx5_max_log_rq_size(int wq_type)
}
 }
 
+static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
+{
+   return is_kdump_kernel() ?
+   MLX5E_MIN_NUM_CHANNELS :
+   min_t(int, mdev->priv.eq_table.num_comp_vectors,
+ MLX5E_MAX_NUM_CHANNELS);
+}
+
 struct mlx5e_tx_wqe {
struct mlx5_wqe_ctrl_seg ctrl;
struct mlx5_wqe_eth_seg  eth;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index cdc34ba354c8..14c7452a6348 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -31,7 +31,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -1710,14 +1709,6 @@ static int mlx5e_set_tx_maxrate(struct net_device *dev, 
int index, u32 rate)
return err;
 }
 
-static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
-{
-   return is_kdump_kernel() ?
-   MLX5E_MIN_NUM_CHANNELS :
-   min_t(int, mdev->priv.eq_table.num_comp_vectors,
- MLX5E_MAX_NUM_CHANNELS);
-}
-
 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
  struct mlx5e_params *params,
  struct mlx5e_channel_param *cparam,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
new file mode 100644
index ..2f65927a8d03
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -0,0 +1,181 @@
+/*
+ * Copyright (c) 2017, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of condit

[PATCH net-next 10/16] net/mlx5e: IPoIB, Underlay QP

Create IPoIB underlay QP needed by the IPoIB netdevice profile for RSS
and TX HW context to perform on IPoIB traffic.

Reset the underlay QP on dev_uninit ndo to stop IPoIB traffic going
through this QP when the ULP IPoIB decides to cleanup.

Implement attach/detach mcast RDMA netdev callbacks for later RDMA
netdev use.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c | 126 +++-
 1 file changed, 124 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index e188d067bc97..bd56f36066b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -34,6 +34,8 @@
 #include "en.h"
 #include "ipoib.h"
 
+#define IB_DEFAULT_Q_KEY   0xb1b
+
 static int mlx5i_open(struct net_device *netdev);
 static int mlx5i_close(struct net_device *netdev);
 static int  mlx5i_dev_init(struct net_device *dev);
@@ -83,12 +85,89 @@ static void mlx5i_cleanup(struct mlx5e_priv *priv)
/* Do nothing .. */
 }
 
+#define MLX5_QP_ENHANCED_ULP_STATELESS_MODE 2
+
+static int mlx5i_create_underlay_qp(struct mlx5_core_dev *mdev, struct 
mlx5_core_qp *qp)
+{
+   struct mlx5_qp_context *context = NULL;
+   u32 *in = NULL;
+   void *addr_path;
+   int ret = 0;
+   int inlen;
+   void *qpc;
+
+   inlen = MLX5_ST_SZ_BYTES(create_qp_in);
+   in = mlx5_vzalloc(inlen);
+   if (!in)
+   return -ENOMEM;
+
+   qpc = MLX5_ADDR_OF(create_qp_in, in, qpc);
+   MLX5_SET(qpc, qpc, st, MLX5_QP_ST_UD);
+   MLX5_SET(qpc, qpc, pm_state, MLX5_QP_PM_MIGRATED);
+   MLX5_SET(qpc, qpc, ulp_stateless_offload_mode,
+MLX5_QP_ENHANCED_ULP_STATELESS_MODE);
+
+   addr_path = MLX5_ADDR_OF(qpc, qpc, primary_address_path);
+   MLX5_SET(ads, addr_path, port, 1);
+   MLX5_SET(ads, addr_path, grh, 1);
+
+   ret = mlx5_core_create_qp(mdev, qp, in, inlen);
+   if (ret) {
+   mlx5_core_err(mdev, "Failed creating IPoIB QP err : %d\n", ret);
+   goto out;
+   }
+
+   /* QP states */
+   context = kzalloc(sizeof(*context), GFP_KERNEL);
+   if (!context) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   context->flags = cpu_to_be32(MLX5_QP_PM_MIGRATED << 11);
+   context->pri_path.port = 1;
+   context->qkey = cpu_to_be32(IB_DEFAULT_Q_KEY);
+
+   ret = mlx5_core_qp_modify(mdev, MLX5_CMD_OP_RST2INIT_QP, 0, context, 
qp);
+   if (ret) {
+   mlx5_core_err(mdev, "Failed to modify qp RST2INIT, err: %d\n", 
ret);
+   goto out;
+   }
+   memset(context, 0, sizeof(*context));
+
+   ret = mlx5_core_qp_modify(mdev, MLX5_CMD_OP_INIT2RTR_QP, 0, context, 
qp);
+   if (ret) {
+   mlx5_core_err(mdev, "Failed to modify qp INIT2RTR, err: %d\n", 
ret);
+   goto out;
+   }
+
+   ret = mlx5_core_qp_modify(mdev, MLX5_CMD_OP_RTR2RTS_QP, 0, context, qp);
+   if (ret) {
+   mlx5_core_err(mdev, "Failed to modify qp RTR2RTS, err: %d\n", 
ret);
+   goto out;
+   }
+
+out:
+   kfree(context);
+   kvfree(in);
+   return ret;
+}
+
+static void mlx5i_destroy_underlay_qp(struct mlx5_core_dev *mdev, struct 
mlx5_core_qp *qp)
+{
+   mlx5_core_destroy_qp(mdev, qp);
+}
+
 static int mlx5i_init_tx(struct mlx5e_priv *priv)
 {
struct mlx5i_priv *ipriv = priv->ppriv;
int err;
 
-   /* TODO: Create IPoIB underlay QP */
+   err = mlx5i_create_underlay_qp(priv->mdev, &ipriv->qp);
+   if (err) {
+   mlx5_core_warn(priv->mdev, "create underlay QP failed, %d\n", 
err);
+   return err;
+   }
 
err = mlx5e_create_tis(priv->mdev, 0 /* tc */, ipriv->qp.qpn, 
&priv->tisn[0]);
if (err) {
@@ -101,7 +180,10 @@ static int mlx5i_init_tx(struct mlx5e_priv *priv)
 
 void mlx5i_cleanup_tx(struct mlx5e_priv *priv)
 {
+   struct mlx5i_priv *ipriv = priv->ppriv;
+
mlx5e_destroy_tis(priv->mdev, priv->tisn[0]);
+   mlx5i_destroy_underlay_qp(priv->mdev, &ipriv->qp);
 }
 
 static int mlx5i_create_flow_steering(struct mlx5e_priv *priv)
@@ -220,7 +302,13 @@ static int mlx5i_dev_init(struct net_device *dev)
 
 static void mlx5i_dev_cleanup(struct net_device *dev)
 {
-   /* TODO: detach underlay qp from flow-steering by reset it */
+   struct mlx5e_priv*priv   = mlx5i_epriv(dev);
+   struct mlx5_core_dev *mdev   = priv->mdev;
+   struct mlx5i_priv*ipriv  = priv->ppriv;
+   struct mlx5_qp_context context;
+
+   /* detach qp from flow-steering by reset it */
+   mlx5_core_qp_modify(mdev, MLX5_CMD_OP_2RST_QP, 0, &context, &ipriv->qp);
 }
 
 static int mlx5i_open(struct net_device *netdev)
@@ -270,6 +358,40 @@ static int mlx5i_close(struct net_device *netdev)
 }
 
 /* IPoIB RDMA netdev callbacks */
+int mlx

[PATCH net-next 06/16] net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs

Implement IPoIB RX RSS (RQTs and TIRs) HW objects creation,
All we do here is simply reuse the mlx5e implementation to create
direct and indirect (RSS) steering HW objects.

For that we just expose
mlx5e_{create,destroy}_{direct,indirect}_{rqt,tir} functions into en.h
and call them from ipoib.c in init/cleanup_rx IPoIB netdevice profile
callbacks.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 12 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 56 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  | 17 ++-
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c   | 42 +++--
 4 files changed, 83 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 02aa3cc59dc3..e5518536d56f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -999,10 +999,17 @@ int mlx5e_attr_get(struct net_device *dev, struct 
switchdev_attr *attr);
 void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 void mlx5e_update_hw_rep_counters(struct mlx5e_priv *priv);
 
+int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv);
+
+int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv);
+void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv);
+
 int mlx5e_create_direct_rqts(struct mlx5e_priv *priv);
-void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt);
+void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv);
 int mlx5e_create_direct_tirs(struct mlx5e_priv *priv);
 void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv);
+void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt);
+
 int mlx5e_create_tises(struct mlx5e_priv *priv);
 void mlx5e_cleanup_nic_tx(struct mlx5e_priv *priv);
 int mlx5e_close(struct net_device *netdev);
@@ -1024,5 +1031,8 @@ mlx5e_create_netdev(struct mlx5_core_dev *mdev, const 
struct mlx5e_profile *prof
 int mlx5e_attach_netdev(struct mlx5e_priv *priv);
 void mlx5e_detach_netdev(struct mlx5e_priv *priv);
 void mlx5e_destroy_netdev(struct mlx5e_priv *priv);
+void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
+   struct mlx5e_params *params,
+   u16 max_channels);
 
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 14c7452a6348..08b67aa24644 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2115,11 +2115,15 @@ void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct 
mlx5e_rqt *rqt)
mlx5_core_destroy_rqt(priv->mdev, rqt->rqtn);
 }
 
-static int mlx5e_create_indirect_rqts(struct mlx5e_priv *priv)
+int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv)
 {
struct mlx5e_rqt *rqt = &priv->indir_rqt;
+   int err;
 
-   return mlx5e_create_rqt(priv, MLX5E_INDIR_RQT_SIZE, rqt);
+   err = mlx5e_create_rqt(priv, MLX5E_INDIR_RQT_SIZE, rqt);
+   if (err)
+   mlx5_core_warn(priv->mdev, "create indirect rqts failed, %d\n", 
err);
+   return err;
 }
 
 int mlx5e_create_direct_rqts(struct mlx5e_priv *priv)
@@ -2138,12 +2142,21 @@ int mlx5e_create_direct_rqts(struct mlx5e_priv *priv)
return 0;
 
 err_destroy_rqts:
+   mlx5_core_warn(priv->mdev, "create direct rqts failed, %d\n", err);
for (ix--; ix >= 0; ix--)
mlx5e_destroy_rqt(priv, &priv->direct_tir[ix].rqt);
 
return err;
 }
 
+void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv)
+{
+   int i;
+
+   for (i = 0; i < priv->profile->max_nch(priv->mdev); i++)
+   mlx5e_destroy_rqt(priv, &priv->direct_tir[i].rqt);
+}
+
 static int mlx5e_rx_hash_fn(int hfunc)
 {
return (hfunc == ETH_RSS_HASH_TOP) ?
@@ -2818,7 +2831,7 @@ static void mlx5e_build_direct_tir_ctx(struct mlx5e_priv 
*priv, u32 rqtn, u32 *t
MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_RX_HASH_FN_INVERTED_XOR8);
 }
 
-static int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
+int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
 {
struct mlx5e_tir *tir;
void *tirc;
@@ -2847,6 +2860,7 @@ static int mlx5e_create_indirect_tirs(struct mlx5e_priv 
*priv)
return 0;
 
 err_destroy_tirs:
+   mlx5_core_warn(priv->mdev, "create indirect tirs failed, %d\n", err);
for (tt--; tt >= 0; tt--)
mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[tt]);
 
@@ -2885,6 +2899,7 @@ int mlx5e_create_direct_tirs(struct mlx5e_priv *priv)
return 0;
 
 err_destroy_ch_tirs:
+   mlx5_core_warn(priv->mdev, "create direct tirs failed, %d\n", err);
for (ix--; ix >= 0; ix--)
mlx5e_destroy_tir(priv->mdev, &priv->direct_tir[ix]);
 
@@ -2893,7 +2908,7 @@ int mlx5e_create_direct_tirs(struct mlx5e_priv *priv)
return err;

[PATCH net-next 12/16] net/mlx5e: IPoIB, Xmit flow

Implement mlx5e's IPoIB SKB transmit using the helper functions provided
by mlx5e ethernet tx flow, the only difference in the code between
mlx5e_xmit and mlx5i_xmit is that IPoIB has some extra fields to fill
(UD datagram segment) in the TX descriptor (WQE) and it doesn't need to
have any vlan handling.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h| 10 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 87 +
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c | 10 +++
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.h |  3 +
 include/linux/mlx5/qp.h | 10 +++
 5 files changed, 110 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 3cd064b5f0bf..ce8ba617d46e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -729,16 +729,6 @@ static inline struct mlx5_ib_mw *to_mmw(struct ib_mw *ibmw)
return container_of(ibmw, struct mlx5_ib_mw, ibmw);
 }
 
-struct mlx5_ib_ah {
-   struct ib_ahibah;
-   struct mlx5_av  av;
-};
-
-static inline struct mlx5_ib_ah *to_mah(struct ib_ah *ibah)
-{
-   return container_of(ibah, struct mlx5_ib_ah, ibah);
-}
-
 int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
struct mlx5_db *db);
 void mlx5_ib_db_unmap_user(struct mlx5_ib_ucontext *context, struct mlx5_db 
*db);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index ba664a1126cf..dda7db503043 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -503,3 +503,90 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
sq->cc += wi->num_wqebbs;
}
 }
+
+#ifdef CONFIG_MLX5_CORE_IPOIB
+
+struct mlx5_wqe_eth_pad {
+   u8 rsvd0[16];
+};
+
+struct mlx5i_tx_wqe {
+   struct mlx5_wqe_ctrl_seg ctrl;
+   struct mlx5_wqe_datagram_seg datagram;
+   struct mlx5_wqe_eth_pad  pad;
+   struct mlx5_wqe_eth_seg  eth;
+};
+
+static inline void
+mlx5i_txwqe_build_datagram(struct mlx5_av *av, u32 dqpn, u32 dqkey,
+  struct mlx5_wqe_datagram_seg *dseg)
+{
+   memcpy(&dseg->av, av, sizeof(struct mlx5_av));
+   dseg->av.dqp_dct = cpu_to_be32(dqpn | MLX5_EXTENDED_UD_AV);
+   dseg->av.key.qkey.qkey = cpu_to_be32(dqkey);
+}
+
+netdev_tx_t mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+ struct mlx5_av *av, u32 dqpn, u32 dqkey)
+{
+   struct mlx5_wq_cyc   *wq   = &sq->wq;
+   u16   pi   = sq->pc & wq->sz_m1;
+   struct mlx5i_tx_wqe  *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
+   struct mlx5e_tx_wqe_info *wi   = &sq->db.wqe_info[pi];
+
+   struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+   struct mlx5_wqe_datagram_seg *datagram = &wqe->datagram;
+   struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
+
+   unsigned char *skb_data = skb->data;
+   unsigned int skb_len = skb->len;
+   u8  opcode = MLX5_OPCODE_SEND;
+   unsigned int num_bytes;
+   int num_dma;
+   u16 headlen;
+   u16 ds_cnt;
+   u16 ihs;
+
+   memset(wqe, 0, sizeof(*wqe));
+
+   mlx5i_txwqe_build_datagram(av, dqpn, dqkey, datagram);
+
+   mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
+
+   if (skb_is_gso(skb)) {
+   opcode = MLX5_OPCODE_LSO;
+   ihs = mlx5e_txwqe_build_eseg_gso(sq, skb, eseg, &num_bytes);
+   } else {
+   ihs = mlx5e_calc_min_inline(sq->min_inline_mode, skb);
+   num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
+   }
+
+   ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
+   if (ihs) {
+   memcpy(eseg->inline_hdr.start, skb_data, ihs);
+   mlx5e_tx_skb_pull_inline(&skb_data, &skb_len, ihs);
+   eseg->inline_hdr.sz = cpu_to_be16(ihs);
+   ds_cnt += DIV_ROUND_UP(ihs - sizeof(eseg->inline_hdr.start), 
MLX5_SEND_WQE_DS);
+   }
+
+   headlen = skb_len - skb->data_len;
+   num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb_data, headlen,
+ (struct mlx5_wqe_data_seg *)cseg + 
ds_cnt);
+   if (unlikely(num_dma < 0))
+   goto dma_unmap_wqe_err;
+
+   mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt + num_dma,
+num_bytes, num_dma, wi, cseg);
+
+   return NETDEV_TX_OK;
+
+dma_unmap_wqe_err:
+   sq->stats.dropped++;
+   mlx5e_dma_unmap_wqe_err(sq, wi->num_dma);
+
+   dev_kfree_skb_any(skb);
+
+   return NETDEV_TX_OK;
+}
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index bd56f36066b3..c468aaedf0a6 100644
--- a/drivers/net/ethernet/mellanox/m

[PATCH net-next 15/16] net/mlx5e: E-switch vport manager is valid for ethernet only

Currently the driver support only ethernet eswitch, and we want to
protect downstream IPoIB netdev from trying to access it in IB link.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6a164aff404c..061b20c73071 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2548,6 +2548,12 @@ static void mlx5e_build_channels_tx_maps(struct 
mlx5e_priv *priv)
}
 }
 
+static bool mlx5e_is_eswitch_vport_mngr(struct mlx5_core_dev *mdev)
+{
+   return (MLX5_CAP_GEN(mdev, vport_group_manager) &&
+   MLX5_CAP_GEN(mdev, port_type) == MLX5_CAP_PORT_TYPE_ETH);
+}
+
 void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
 {
int num_txqs = priv->channels.num * priv->channels.params.num_tc;
@@ -2561,7 +2567,7 @@ void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
mlx5e_activate_channels(&priv->channels);
netif_tx_start_all_queues(priv->netdev);
 
-   if (MLX5_CAP_GEN(priv->mdev, vport_group_manager))
+   if (mlx5e_is_eswitch_vport_mngr(priv->mdev))
mlx5e_add_sqs_fwd_rules(priv);
 
mlx5e_wait_channels_min_rx_wqes(&priv->channels);
@@ -2572,7 +2578,7 @@ void mlx5e_deactivate_priv_channels(struct mlx5e_priv 
*priv)
 {
mlx5e_redirect_rqts_to_drop(priv);
 
-   if (MLX5_CAP_GEN(priv->mdev, vport_group_manager))
+   if (mlx5e_is_eswitch_vport_mngr(priv->mdev))
mlx5e_remove_sqs_fwd_rules(priv);
 
/* FIXME: This is a W/A only for tx timeout watch dog false alarm when
-- 
2.11.0

[PATCH net-next 14/16] net/mlx5e: IPoIB, RX handler

Implement IPoIB RX SKB handler.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 78 +
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c |  2 +
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.h |  1 +
 3 files changed, 81 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1a9532b31635..43308243f519 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1031,3 +1031,81 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
mlx5e_page_release(rq, di, false);
}
 }
+
+#ifdef CONFIG_MLX5_CORE_IPOIB
+
+#define MLX5_IB_GRH_DGID_OFFSET 24
+#define MLX5_IB_GRH_BYTES   40
+#define MLX5_IPOIB_ENCAP_LEN4
+#define MLX5_GID_SIZE   16
+
+static inline void mlx5i_complete_rx_cqe(struct mlx5e_rq *rq,
+struct mlx5_cqe64 *cqe,
+u32 cqe_bcnt,
+struct sk_buff *skb)
+{
+   struct net_device *netdev = rq->netdev;
+   u8 *dgid;
+   u8 g;
+
+   g = (be32_to_cpu(cqe->flags_rqpn) >> 28) & 3;
+   dgid = skb->data + MLX5_IB_GRH_DGID_OFFSET;
+   if ((!g) || dgid[0] != 0xff)
+   skb->pkt_type = PACKET_HOST;
+   else if (memcmp(dgid, netdev->broadcast + 4, MLX5_GID_SIZE) == 0)
+   skb->pkt_type = PACKET_BROADCAST;
+   else
+   skb->pkt_type = PACKET_MULTICAST;
+
+   /* TODO: IB/ipoib: Allow mcast packets from other VFs
+* 68996a6e760e5c74654723eeb57bf65628ae87f4
+*/
+
+   skb_pull(skb, MLX5_IB_GRH_BYTES);
+
+   skb->protocol = *((__be16 *)(skb->data));
+
+   skb->ip_summed = CHECKSUM_COMPLETE;
+   skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
+
+   skb_record_rx_queue(skb, rq->ix);
+
+   if (likely(netdev->features & NETIF_F_RXHASH))
+   mlx5e_skb_set_hash(cqe, skb);
+
+   skb_reset_mac_header(skb);
+   skb_pull(skb, MLX5_IPOIB_ENCAP_LEN);
+
+   skb->dev = netdev;
+
+   rq->stats.csum_complete++;
+   rq->stats.packets++;
+   rq->stats.bytes += cqe_bcnt;
+}
+
+void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+{
+   struct mlx5e_rx_wqe *wqe;
+   __be16 wqe_counter_be;
+   struct sk_buff *skb;
+   u16 wqe_counter;
+   u32 cqe_bcnt;
+
+   wqe_counter_be = cqe->wqe_counter;
+   wqe_counter= be16_to_cpu(wqe_counter_be);
+   wqe= mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
+   cqe_bcnt   = be32_to_cpu(cqe->byte_cnt);
+
+   skb = skb_from_cqe(rq, cqe, wqe_counter, cqe_bcnt);
+   if (!skb)
+   goto wq_ll_pop;
+
+   mlx5i_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
+   napi_gro_receive(rq->cq.napi, skb);
+
+wq_ll_pop:
+   mlx5_wq_ll_pop(&rq->wq, wqe_counter_be,
+  &wqe->next.next_wqe_index);
+}
+
+#endif /* CONFIG_MLX5_CORE_IPOIB */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index c468aaedf0a6..001d2953cb6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -282,6 +282,8 @@ static const struct mlx5e_profile mlx5i_nic_profile = {
.disable   = NULL, /* mlx5i_disable */
.update_stats  = NULL, /* mlx5i_update_stats */
.max_nch   = mlx5e_get_max_num_channels,
+   .rx_handlers.handle_rx_cqe   = mlx5i_handle_rx_cqe,
+   .rx_handlers.handle_rx_cqe_mpwqe = NULL, /* Not supported */
.max_tc= MLX5I_MAX_NUM_TC,
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.h 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.h
index 89bca182464c..bae0a5cbc8ad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.h
@@ -49,5 +49,6 @@ struct mlx5i_priv {
 
 netdev_tx_t mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
  struct mlx5_av *av, u32 dqpn, u32 dqkey);
+void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 
 #endif /* __MLX5E_IPOB_H__ */
-- 
2.11.0

[PATCH net-next 16/16] hw/mlx5: Add New bit to check over QP creation

From: Erez Shitrit 

Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.

Signed-off-by: Erez Shitrit 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/qp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index ad8a2638e339..ed6320186f89 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -897,6 +897,7 @@ static int create_kernel_qp(struct mlx5_ib_dev *dev,
if (init_attr->create_flags & ~(IB_QP_CREATE_SIGNATURE_EN |
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK |
IB_QP_CREATE_IPOIB_UD_LSO |
+   IB_QP_CREATE_NETIF_QP |
mlx5_ib_create_qp_sqpn_qp1()))
return -EINVAL;
 
-- 
2.11.0

[PATCH net-next 11/16] net/mlx5e: Xmit flow break down

Break current mlx5e xmit flow into smaller blocks (helper functions)
in order to reuse them for IPoIB SKB transmission.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   7 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   | 199 +-
 3 files changed, 119 insertions(+), 89 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 23b92ec54e12..25185f8c3562 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -304,6 +304,7 @@ struct mlx5e_cq {
 } cacheline_aligned_in_smp;
 
 struct mlx5e_tx_wqe_info {
+   struct sk_buff *skb;
u32 num_bytes;
u8  num_wqebbs;
u8  num_dma;
@@ -345,7 +346,6 @@ struct mlx5e_txqsq {
 
/* write@xmit, read@completion */
struct {
-   struct sk_buff   **skb;
struct mlx5e_sq_dma   *dma_fifo;
struct mlx5e_tx_wqe_info  *wqe_info;
} db;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index eb657987e9b5..2201b7ea05f4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1042,7 +1042,6 @@ static void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq)
 {
kfree(sq->db.wqe_info);
kfree(sq->db.dma_fifo);
-   kfree(sq->db.skb);
 }
 
 static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
@@ -1050,13 +1049,11 @@ static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, 
int numa)
int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
 
-   sq->db.skb = kzalloc_node(wq_sz * sizeof(*sq->db.skb),
- GFP_KERNEL, numa);
sq->db.dma_fifo = kzalloc_node(df_sz * sizeof(*sq->db.dma_fifo),
   GFP_KERNEL, numa);
sq->db.wqe_info = kzalloc_node(wq_sz * sizeof(*sq->db.wqe_info),
   GFP_KERNEL, numa);
-   if (!sq->db.skb || !sq->db.dma_fifo || !sq->db.wqe_info) {
+   if (!sq->db.dma_fifo || !sq->db.wqe_info) {
mlx5e_free_txqsq_db(sq);
return -ENOMEM;
}
@@ -1295,7 +1292,7 @@ static void mlx5e_deactivate_txqsq(struct mlx5e_txqsq *sq)
if (mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, 1)) {
struct mlx5e_tx_wqe *nop;
 
-   sq->db.skb[(sq->pc & sq->wq.sz_m1)] = NULL;
+   sq->db.wqe_info[(sq->pc & sq->wq.sz_m1)].skb = NULL;
nop = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &nop->ctrl);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 5bbc313e70c5..ba664a1126cf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -177,30 +177,9 @@ static inline void mlx5e_insert_vlan(void *start, struct 
sk_buff *skb, u16 ihs,
mlx5e_tx_skb_pull_inline(skb_data, skb_len, cpy2_sz);
 }
 
-static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb)
+static inline void
+mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, 
struct mlx5_wqe_eth_seg *eseg)
 {
-   struct mlx5_wq_cyc   *wq   = &sq->wq;
-
-   u16 pi = sq->pc & wq->sz_m1;
-   struct mlx5e_tx_wqe  *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
-   struct mlx5e_tx_wqe_info *wi   = &sq->db.wqe_info[pi];
-
-   struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
-   struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
-   struct mlx5_wqe_data_seg *dseg;
-
-   unsigned char *skb_data = skb->data;
-   unsigned int skb_len = skb->len;
-   u8  opcode = MLX5_OPCODE_SEND;
-   dma_addr_t dma_addr = 0;
-   unsigned int num_bytes;
-   u16 headlen;
-   u16 ds_cnt;
-   u16 ihs;
-   int i;
-
-   memset(wqe, 0, sizeof(*wqe));
-
if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
if (skb->encapsulation) {
@@ -212,66 +191,51 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, 
struct sk_buff *skb)
}
} else
sq->stats.csum_none++;
+}
 
-   if (skb_is_gso(skb)) {
-   eseg->mss= cpu_to_be16(skb_shinfo(skb)->gso_size);
-   opcode   = MLX5_OPCODE_LSO;
+static inline u16
+mlx5e_txwqe_build_eseg_gso(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+  struct mlx5_wqe_eth_seg *eseg, unsigned int 
*num_bytes)
+{
+   u16 ihs;
 
-   if (skb->encapsulation) {
-   ihs = skb_inner_

[PATCH net-next 08/16] net/mlx5e: IPoIB, TX TIS creation

Modify mlx5e tis creation function to accept underlay qp number, which
will be needed by IPoIB.

Implement mlx5i (IPoIB) tx init/cleanup netdevice profile flows to
create one TIS with the IPoIB underlay qp, for IPoIB TX SQs.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 18 ++
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c   | 14 --
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c813eab5d764..5345d875b695 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -1014,6 +1014,10 @@ void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct 
mlx5e_rqt *rqt);
 int mlx5e_create_ttc_table(struct mlx5e_priv *priv, u32 underlay_qpn);
 void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv);
 
+int mlx5e_create_tis(struct mlx5_core_dev *mdev, int tc,
+u32 underlay_qpn, u32 *tisn);
+void mlx5e_destroy_tis(struct mlx5_core_dev *mdev, u32 tisn);
+
 int mlx5e_create_tises(struct mlx5e_priv *priv);
 void mlx5e_cleanup_nic_tx(struct mlx5e_priv *priv);
 int mlx5e_close(struct net_device *netdev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 08b67aa24644..1fde4e2301a4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2759,24 +2759,25 @@ static void mlx5e_close_drop_rq(struct mlx5e_rq 
*drop_rq)
mlx5e_free_cq(&drop_rq->cq);
 }
 
-static int mlx5e_create_tis(struct mlx5e_priv *priv, int tc)
+int mlx5e_create_tis(struct mlx5_core_dev *mdev, int tc,
+u32 underlay_qpn, u32 *tisn)
 {
-   struct mlx5_core_dev *mdev = priv->mdev;
u32 in[MLX5_ST_SZ_DW(create_tis_in)] = {0};
void *tisc = MLX5_ADDR_OF(create_tis_in, in, ctx);
 
MLX5_SET(tisc, tisc, prio, tc << 1);
+   MLX5_SET(tisc, tisc, underlay_qpn, underlay_qpn);
MLX5_SET(tisc, tisc, transport_domain, mdev->mlx5e_res.td.tdn);
 
if (mlx5_lag_is_lacp_owner(mdev))
MLX5_SET(tisc, tisc, strict_lag_tx_port_affinity, 1);
 
-   return mlx5_core_create_tis(mdev, in, sizeof(in), &priv->tisn[tc]);
+   return mlx5_core_create_tis(mdev, in, sizeof(in), tisn);
 }
 
-static void mlx5e_destroy_tis(struct mlx5e_priv *priv, int tc)
+void mlx5e_destroy_tis(struct mlx5_core_dev *mdev, u32 tisn)
 {
-   mlx5_core_destroy_tis(priv->mdev, priv->tisn[tc]);
+   mlx5_core_destroy_tis(mdev, tisn);
 }
 
 int mlx5e_create_tises(struct mlx5e_priv *priv)
@@ -2785,7 +2786,7 @@ int mlx5e_create_tises(struct mlx5e_priv *priv)
int tc;
 
for (tc = 0; tc < priv->profile->max_tc; tc++) {
-   err = mlx5e_create_tis(priv, tc);
+   err = mlx5e_create_tis(priv->mdev, tc, 0, &priv->tisn[tc]);
if (err)
goto err_close_tises;
}
@@ -2794,7 +2795,7 @@ int mlx5e_create_tises(struct mlx5e_priv *priv)
 
 err_close_tises:
for (tc--; tc >= 0; tc--)
-   mlx5e_destroy_tis(priv, tc);
+   mlx5e_destroy_tis(priv->mdev, priv->tisn[tc]);
 
return err;
 }
@@ -2804,7 +2805,7 @@ void mlx5e_cleanup_nic_tx(struct mlx5e_priv *priv)
int tc;
 
for (tc = 0; tc < priv->profile->max_tc; tc++)
-   mlx5e_destroy_tis(priv, tc);
+   mlx5e_destroy_tis(priv->mdev, priv->tisn[tc]);
 }
 
 static void mlx5e_build_indir_tir_ctx(struct mlx5e_priv *priv,
@@ -3841,6 +3842,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
mlx5e_set_rq_params(mdev, params);
 
/* HW LRO */
+   /* TODO: && MLX5_CAP_ETH(mdev, lro_cap) */
if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
params->lro_en = true;
params->lro_timeout = mlx5e_choose_lro_timeout(mdev, 
MLX5E_DEFAULT_LRO_TIMEOUT);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index e16e1c7b246e..d7d705c840ae 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -63,13 +63,23 @@ static void mlx5i_cleanup(struct mlx5e_priv *priv)
 
 static int mlx5i_init_tx(struct mlx5e_priv *priv)
 {
+   struct mlx5i_priv *ipriv = priv->ppriv;
+   int err;
+
/* TODO: Create IPoIB underlay QP */
-   /* TODO: create IPoIB TX HW TIS */
+
+   err = mlx5e_create_tis(priv->mdev, 0 /* tc */, ipriv->qp.qpn, 
&priv->tisn[0]);
+   if (err) {
+   mlx5_core_warn(priv->mdev, "create tis failed, %d\n", err);
+   return err;
+   }
+
return 0;
 }
 
-static void mlx5i_cleanup_tx(struct mlx5e_priv *priv)
+void mlx5i_cleanup_tx(struct ml

[PATCH net-next 13/16] net/mlx5e: RX handlers per netdev profile

In order to have different RX handler per profile, fix and refactor the
current code to take the rx handler directly from the netdevice profile
rather than computing it on runtime as it was done with the switchdev
mode representor rx handler.

This will also remove the current wrong assumption in mlx5e_alloc_rq
code that mlx5e_priv->ppriv is of the type vport_rep.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Erez Shitrit 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  5 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 28 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  |  4 +++-
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 25185f8c3562..0881325fba04 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -779,6 +779,10 @@ struct mlx5e_profile {
void(*disable)(struct mlx5e_priv *priv);
void(*update_stats)(struct mlx5e_priv *priv);
int (*max_nch)(struct mlx5_core_dev *mdev);
+   struct {
+   mlx5e_fp_handle_rx_cqe handle_rx_cqe;
+   mlx5e_fp_handle_rx_cqe handle_rx_cqe_mpwqe;
+   } rx_handlers;
int max_tc;
 };
 
@@ -1032,7 +1036,6 @@ int mlx5e_get_offload_stats(int attr_id, const struct 
net_device *dev,
 bool mlx5e_has_offload_stats(const struct net_device *dev, int attr_id);
 
 bool mlx5e_is_uplink_rep(struct mlx5e_priv *priv);
-bool mlx5e_is_vf_vport_rep(struct mlx5e_priv *priv);
 
 /* mlx5e generic netdev management API */
 struct net_device*
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2201b7ea05f4..6a164aff404c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -585,15 +585,17 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-   if (mlx5e_is_vf_vport_rep(c->priv)) {
-   err = -EINVAL;
-   goto err_rq_wq_destroy;
-   }
 
-   rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
+   rq->handle_rx_cqe = 
c->priv->profile->rx_handlers.handle_rx_cqe_mpwqe;
+   if (!rq->handle_rx_cqe) {
+   err = -EINVAL;
+   netdev_err(c->netdev, "RX handler of MPWQE RQ is not 
set, err %d\n", err);
+   goto err_rq_wq_destroy;
+   }
+
rq->mpwqe_stride_sz = BIT(params->mpwqe_log_stride_sz);
rq->mpwqe_num_strides = BIT(params->mpwqe_log_num_strides);
 
@@ -616,15 +618,17 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
err = -ENOMEM;
goto err_rq_wq_destroy;
}
-
-   if (mlx5e_is_vf_vport_rep(c->priv))
-   rq->handle_rx_cqe = mlx5e_handle_rx_cqe_rep;
-   else
-   rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
-
rq->alloc_wqe = mlx5e_alloc_rx_wqe;
rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
 
+   rq->handle_rx_cqe = c->priv->profile->rx_handlers.handle_rx_cqe;
+   if (!rq->handle_rx_cqe) {
+   kfree(rq->dma_info);
+   err = -EINVAL;
+   netdev_err(c->netdev, "RX handler of RQ is not set, err 
%d\n", err);
+   goto err_rq_wq_destroy;
+   }
+
rq->buff.wqe_sz = params->lro_en  ?
params->lro_wqe_sz :
MLX5E_SW2HW_MTU(c->netdev->mtu);
@@ -4229,6 +4233,8 @@ static const struct mlx5e_profile mlx5e_nic_profile = {
.disable   = mlx5e_nic_disable,
.update_stats  = mlx5e_update_stats,
.max_nch   = mlx5e_get_max_num_channels,
+   .rx_handlers.handle_rx_cqe   = mlx5e_handle_rx_cqe,
+   .rx_handlers.handle_rx_cqe_mpwqe = mlx5e_handle_rx_cqe_mpwrq,
.max_tc= MLX5E_MAX_NUM_TC,
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index da85b0ad3e92..16b683e8226d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -329,7 +329,7 @@ bool mlx5e_is_uplink_rep(struct mlx5e_priv *priv)
return false;
 }
 
-bool mlx5e_is_vf_vport_rep(struct mlx5e_priv *priv)
+static bool mlx5e_is_vf_vport_rep(struct mlx5e_priv *priv)
 {
struct mlx5_eswitch_rep *rep = (struct mlx5_eswitch_rep *)priv->ppriv;
 
@@ -538,6 +538,8 @@ static struct mlx5e_profile mlx5e_rep_profile

[PATCH rdma-next v2 12/12] IB/hfi1: VNIC SDMA support

HFI1 VNIC SDMA support enables transmission of VNIC packets over SDMA.
Map VNIC queues to SDMA engines and support halting and wakeup of the
VNIC queues.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 drivers/infiniband/hw/hfi1/Makefile|   2 +-
 drivers/infiniband/hw/hfi1/hfi.h   |   1 +
 drivers/infiniband/hw/hfi1/init.c  |   1 +
 drivers/infiniband/hw/hfi1/vnic.h  |  28 +++
 drivers/infiniband/hw/hfi1/vnic_main.c |  24 ++-
 drivers/infiniband/hw/hfi1/vnic_sdma.c | 323 +
 6 files changed, 376 insertions(+), 3 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_sdma.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile 
b/drivers/infiniband/hw/hfi1/Makefile
index 2280538..88085f6 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -12,7 +12,7 @@ hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
-   verbs_txreq.o vnic_main.o
+   verbs_txreq.o vnic_main.o vnic_sdma.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index a12bb46..2862b14 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -834,6 +834,7 @@ struct hfi1_asic_data {
 /* Virtual NIC information */
 struct hfi1_vnic_data {
struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT];
+   struct kmem_cache *txreq_cache;
u8 num_vports;
struct idr vesw_idr;
u8 rmt_start;
diff --git a/drivers/infiniband/hw/hfi1/init.c 
b/drivers/infiniband/hw/hfi1/init.c
index de2eec4..b4c7e04 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -681,6 +681,7 @@ int hfi1_init(struct hfi1_devdata *dd, int reinit)
dd->process_pio_send = hfi1_verbs_send_pio;
dd->process_dma_send = hfi1_verbs_send_dma;
dd->pio_inline_send = pio_copy;
+   dd->process_vnic_dma_send = hfi1_vnic_send_dma;
 
if (is_ax(dd)) {
atomic_set(&dd->drop_packet, DROP_PACKET_ON);
diff --git a/drivers/infiniband/hw/hfi1/vnic.h 
b/drivers/infiniband/hw/hfi1/vnic.h
index 9bed40d..e2c4552 100644
--- a/drivers/infiniband/hw/hfi1/vnic.h
+++ b/drivers/infiniband/hw/hfi1/vnic.h
@@ -49,6 +49,7 @@
 
 #include 
 #include "hfi.h"
+#include "sdma.h"
 
 #define HFI1_VNIC_MAX_TXQ 16
 #define HFI1_VNIC_MAX_PAD 12
@@ -85,6 +86,26 @@
 #define HFI1_VNIC_MAX_QUEUE 16
 
 /**
+ * struct hfi1_vnic_sdma - VNIC per Tx ring SDMA information
+ * @dd - device data pointer
+ * @sde - sdma engine
+ * @vinfo - vnic info pointer
+ * @wait - iowait structure
+ * @stx - sdma tx request
+ * @state - vnic Tx ring SDMA state
+ * @q_idx - vnic Tx queue index
+ */
+struct hfi1_vnic_sdma {
+   struct hfi1_devdata *dd;
+   struct sdma_engine  *sde;
+   struct hfi1_vnic_vport_info *vinfo;
+   struct iowait wait;
+   struct sdma_txreq stx;
+   unsigned int state;
+   u8 q_idx;
+};
+
+/**
  * struct hfi1_vnic_rx_queue - HFI1 VNIC receive queue
  * @idx: queue index
  * @vinfo: pointer to vport information
@@ -111,6 +132,7 @@ struct hfi1_vnic_rx_queue {
  * @vesw_id: virtual switch id
  * @rxq: Array of receive queues
  * @stats: per queue stats
+ * @sdma: VNIC SDMA structure per TXQ
  */
 struct hfi1_vnic_vport_info {
struct hfi1_devdata *dd;
@@ -126,6 +148,7 @@ struct hfi1_vnic_vport_info {
struct hfi1_vnic_rx_queue rxq[HFI1_NUM_VNIC_CTXT];
 
struct opa_vnic_stats  stats[HFI1_VNIC_MAX_QUEUE];
+   struct hfi1_vnic_sdma  sdma[HFI1_VNIC_MAX_TXQ];
 };
 
 #define v_dbg(format, arg...) \
@@ -138,8 +161,13 @@ struct hfi1_vnic_vport_info {
 /* vnic hfi1 internal functions */
 void hfi1_vnic_setup(struct hfi1_devdata *dd);
 void hfi1_vnic_cleanup(struct hfi1_devdata *dd);
+int hfi1_vnic_txreq_init(struct hfi1_devdata *dd);
+void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd);
 
 void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet);
+void hfi1_vnic_sdma_init(struct hfi1_vnic_vport_info *vinfo);
+bool hfi1_vnic_sdma_write_avail(struct hfi1_vnic_vport_info *vinfo,
+   u8 q_idx);
 
 /* vnic rdma netdev operations */
 struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 32d91b6..392f4d5 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -406,6 +406,10 @@ static void hfi1_vnic_maybe_stop_tx(struct 
hfi1_vnic_vport_info *vinfo,
u8 q_idx)
 {
netif_stop_subqueue(vinfo->netdev, q_idx);
+   if (!hfi1_vnic_sdma_write_avail(vinfo, q_idx))
+   return;
+

[PATCH rdma-next v2 05/12] IB/opa-vnic: VNIC Ethernet Management (EM) structure definitions

Define VNIC EM MAD structures and the associated macros. These structures
are used for information exchange between VNIC EM agent (EMA) on the host
and the Ethernet manager. These include the virtual ethernet switch (vesw)
port information, vesw port mac table, summay and error counters,
vesw port interface mac lists and the EMA trap.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h   | 423 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|  33 ++
 2 files changed, 456 insertions(+)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
index 176fca9..c025cde 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
@@ -52,6 +52,28 @@
  * and decapsulation of Ethernet packets
  */
 
+#include 
+#include 
+
+/* EMA class version */
+#define OPA_EMA_CLASS_VERSION   0x80
+
+/*
+ * Define the Intel vendor management class for OPA
+ * ETHERNET MANAGEMENT
+ */
+#define OPA_MGMT_CLASS_INTEL_EMA0x34
+
+/* EM attribute IDs */
+#define OPA_EM_ATTR_CLASS_PORT_INFO 0x0001
+#define OPA_EM_ATTR_VESWPORT_INFO   0x0011
+#define OPA_EM_ATTR_VESWPORT_MAC_ENTRIES0x0012
+#define OPA_EM_ATTR_IFACE_UCAST_MACS0x0013
+#define OPA_EM_ATTR_IFACE_MCAST_MACS0x0014
+#define OPA_EM_ATTR_DELETE_VESW 0x0015
+#define OPA_EM_ATTR_VESWPORT_SUMMARY_COUNTERS   0x0020
+#define OPA_EM_ATTR_VESWPORT_ERROR_COUNTERS 0x0022
+
 /* VNIC configured and operational state values */
 #define OPA_VNIC_STATE_DROP_ALL0x1
 #define OPA_VNIC_STATE_FORWARDING  0x3
@@ -59,4 +81,405 @@
 #define OPA_VESW_MAX_NUM_DEF_PORT   16
 #define OPA_VNIC_MAX_NUM_PCP8
 
+#define OPA_VNIC_EMA_DATA(OPA_MGMT_MAD_SIZE - IB_MGMT_VENDOR_HDR)
+
+/* Defines for vendor specific notice(trap) attributes */
+#define OPA_INTEL_EMA_NOTICE_TYPE_INFO 0x04
+
+/* INTEL OUI */
+#define INTEL_OUI_1 0x00
+#define INTEL_OUI_2 0x06
+#define INTEL_OUI_3 0x6a
+
+/* Trap opcodes sent from VNIC */
+#define OPA_VESWPORT_TRAP_IFACE_UCAST_MAC_CHANGE 0x1
+#define OPA_VESWPORT_TRAP_IFACE_MCAST_MAC_CHANGE 0x2
+#define OPA_VESWPORT_TRAP_ETH_LINK_STATUS_CHANGE 0x3
+
+#define OPA_VNIC_DLID_SD_IS_SRC_MAC(dlid_sd)  (!!((dlid_sd) & 0x20))
+#define OPA_VNIC_DLID_SD_GET_DLID(dlid_sd)((dlid_sd) >> 8)
+
+/**
+ * struct opa_vesw_info - OPA vnic switch information
+ * @fabric_id: 10-bit fabric id
+ * @vesw_id: 12-bit virtual ethernet switch id
+ * @def_port_mask: bitmask of default ports
+ * @pkey: partition key
+ * @u_mcast_dlid: unknown multicast dlid
+ * @u_ucast_dlid: array of unknown unicast dlids
+ * @eth_mtu: MTUs for each vlan PCP
+ * @eth_mtu_non_vlan: MTU for non vlan packets
+ */
+struct opa_vesw_info {
+   __be16  fabric_id;
+   __be16  vesw_id;
+
+   u8  rsvd0[6];
+   __be16  def_port_mask;
+
+   u8  rsvd1[2];
+   __be16  pkey;
+
+   u8  rsvd2[4];
+   __be32  u_mcast_dlid;
+   __be32  u_ucast_dlid[OPA_VESW_MAX_NUM_DEF_PORT];
+
+   u8  rsvd3[44];
+   __be16  eth_mtu[OPA_VNIC_MAX_NUM_PCP];
+   __be16  eth_mtu_non_vlan;
+   u8  rsvd4[2];
+} __packed;
+
+/**
+ * struct opa_per_veswport_info - OPA vnic per port information
+ * @port_num: port number
+ * @eth_link_status: current ethernet link state
+ * @base_mac_addr: base mac address
+ * @config_state: configured port state
+ * @oper_state: operational port state
+ * @max_mac_tbl_ent: max number of mac table entries
+ * @max_smac_ent: max smac entries in mac table
+ * @mac_tbl_digest: mac table digest
+ * @encap_slid: base slid for the port
+ * @pcp_to_sc_uc: sc by pcp index for unicast ethernet packets
+ * @pcp_to_vl_uc: vl by pcp index for unicast ethernet packets
+ * @pcp_to_sc_mc: sc by pcp index for multicast ethernet packets
+ * @pcp_to_vl_mc: vl by pcp index for multicast ethernet packets
+ * @non_vlan_sc_uc: sc for non-vlan unicast ethernet packets
+ * @non_vlan_vl_uc: vl for non-vlan unicast ethernet packets
+ * @non_vlan_sc_mc: sc for non-vlan multicast ethernet packets
+ * @non_vlan_vl_mc: vl for non-vlan multicast ethernet packets
+ * @uc_macs_gen_count: generation count for unicast macs list
+ * @mc_macs_gen_count: generation count for multicast macs list
+ */
+struct opa_per_veswport_info {
+   __be32  port_num;
+
+   u8  eth_link_status;
+   u8  rsvd0[3];
+
+   u8  base_mac_addr[ETH_ALEN];
+   u8  config_state;
+   u8  oper_state;
+
+   __be16  max_mac_tbl_ent;
+   __be16  max_smac_ent;
+   __be32  mac_tbl_digest;
+   u8  rsvd1[4];
+
+   __be32  encap_slid;
+
+   u8  pcp_to_sc_uc[OPA_VNIC_MAX_NUM_PCP];
+   u8  pcp_to_vl_uc[OPA_VNIC_MAX_NUM_PCP];
+   u8

[PATCH rdma-next v2 06/12] IB/opa-vnic: VNIC statistics support

OPA VNIC driver statistics support maintains various counters including
standard netdev counters and the Ethernet manager defined counters.
Add the Ethtool hook to read the counters.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c | 110 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|   4 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  |  18 
 3 files changed, 132 insertions(+)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
index b74f6ad..a98948c 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
@@ -53,9 +53,119 @@
 
 #include "opa_vnic_internal.h"
 
+enum {NETDEV_STATS, VNIC_STATS};
+
+struct vnic_stats {
+   char stat_string[ETH_GSTRING_LEN];
+   struct {
+   int sizeof_stat;
+   int stat_offset;
+   };
+};
+
+#define VNIC_STAT(m){ FIELD_SIZEOF(struct opa_vnic_stats, m),   \
+ offsetof(struct opa_vnic_stats, m) }
+
+static struct vnic_stats vnic_gstrings_stats[] = {
+   /* NETDEV stats */
+   {"rx_packets", VNIC_STAT(netstats.rx_packets)},
+   {"tx_packets", VNIC_STAT(netstats.tx_packets)},
+   {"rx_bytes", VNIC_STAT(netstats.rx_bytes)},
+   {"tx_bytes", VNIC_STAT(netstats.tx_bytes)},
+   {"rx_errors", VNIC_STAT(netstats.rx_errors)},
+   {"tx_errors", VNIC_STAT(netstats.tx_errors)},
+   {"rx_dropped", VNIC_STAT(netstats.rx_dropped)},
+   {"tx_dropped", VNIC_STAT(netstats.tx_dropped)},
+
+   /* SUMMARY counters */
+   {"tx_unicast", VNIC_STAT(tx_grp.unicast)},
+   {"tx_mcastbcast", VNIC_STAT(tx_grp.mcastbcast)},
+   {"tx_untagged", VNIC_STAT(tx_grp.untagged)},
+   {"tx_vlan", VNIC_STAT(tx_grp.vlan)},
+
+   {"tx_64_size", VNIC_STAT(tx_grp.s_64)},
+   {"tx_65_127", VNIC_STAT(tx_grp.s_65_127)},
+   {"tx_128_255", VNIC_STAT(tx_grp.s_128_255)},
+   {"tx_256_511", VNIC_STAT(tx_grp.s_256_511)},
+   {"tx_512_1023", VNIC_STAT(tx_grp.s_512_1023)},
+   {"tx_1024_1518", VNIC_STAT(tx_grp.s_1024_1518)},
+   {"tx_1519_max", VNIC_STAT(tx_grp.s_1519_max)},
+
+   {"rx_unicast", VNIC_STAT(rx_grp.unicast)},
+   {"rx_mcastbcast", VNIC_STAT(rx_grp.mcastbcast)},
+   {"rx_untagged", VNIC_STAT(rx_grp.untagged)},
+   {"rx_vlan", VNIC_STAT(rx_grp.vlan)},
+
+   {"rx_64_size", VNIC_STAT(rx_grp.s_64)},
+   {"rx_65_127", VNIC_STAT(rx_grp.s_65_127)},
+   {"rx_128_255", VNIC_STAT(rx_grp.s_128_255)},
+   {"rx_256_511", VNIC_STAT(rx_grp.s_256_511)},
+   {"rx_512_1023", VNIC_STAT(rx_grp.s_512_1023)},
+   {"rx_1024_1518", VNIC_STAT(rx_grp.s_1024_1518)},
+   {"rx_1519_max", VNIC_STAT(rx_grp.s_1519_max)},
+
+   /* ERROR counters */
+   {"rx_fifo_errors", VNIC_STAT(netstats.rx_fifo_errors)},
+   {"rx_length_errors", VNIC_STAT(netstats.rx_length_errors)},
+
+   {"tx_fifo_errors", VNIC_STAT(netstats.tx_fifo_errors)},
+   {"tx_carrier_errors", VNIC_STAT(netstats.tx_carrier_errors)},
+
+   {"tx_dlid_zero", VNIC_STAT(tx_dlid_zero)},
+   {"tx_drop_state", VNIC_STAT(tx_drop_state)},
+   {"rx_drop_state", VNIC_STAT(rx_drop_state)},
+   {"rx_oversize", VNIC_STAT(rx_oversize)},
+   {"rx_runt", VNIC_STAT(rx_runt)},
+};
+
+#define VNIC_STATS_LEN  ARRAY_SIZE(vnic_gstrings_stats)
+
+/* vnic_get_sset_count - get string set count */
+static int vnic_get_sset_count(struct net_device *netdev, int sset)
+{
+   return (sset == ETH_SS_STATS) ? VNIC_STATS_LEN : -EOPNOTSUPP;
+}
+
+/* vnic_get_ethtool_stats - get statistics */
+static void vnic_get_ethtool_stats(struct net_device *netdev,
+  struct ethtool_stats *stats, u64 *data)
+{
+   struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
+   struct opa_vnic_stats vstats;
+   int i;
+
+   memset(&vstats, 0, sizeof(vstats));
+   mutex_lock(&adapter->stats_lock);
+   adapter->rn_ops->ndo_get_stats64(netdev, &vstats.netstats);
+   for (i = 0; i < VNIC_STATS_LEN; i++) {
+   char *p = (char *)&vstats + vnic_gstrings_stats[i].stat_offset;
+
+   data[i] = (vnic_gstrings_stats[i].sizeof_stat ==
+  sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
+   }
+   mutex_unlock(&adapter->stats_lock);
+}
+
+/* vnic_get_strings - get strings */
+static void vnic_get_strings(struct net_device *netdev, u32 stringset, u8 
*data)
+{
+   int i;
+
+   if (stringset != ETH_SS_STATS)
+   return;
+
+   for (i = 0; i < VNIC_STATS_LEN; i++)
+   memcpy(data + i * ETH_GSTRING_LEN,
+  vnic_gstrings_stats[i].stat_string,
+  ETH_GSTRING_LEN);
+}
+
 /* ethtool ops */
 static const struct ethtool_ops opa_vnic_ethtool_ops = {
.get_link

[PATCH rdma-next v2 02/12] IB/opa-vnic: RDMA NETDEV interface

Add rdma netdev interface to ib device structure allowing rdma netdev
devices to be allocated by ib clients.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 include/rdma/ib_verbs.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3a8e058..5c6b8c0 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1877,6 +1878,24 @@ struct ib_port_immutable {
u32   max_mad_size;
 };
 
+/* rdma netdev type - specifies protocol type */
+enum rdma_netdev_t {
+   RDMA_NETDEV_OPA_VNIC
+};
+
+/**
+ * struct rdma_netdev - rdma netdev
+ * For cases where netstack interfacing is required.
+ */
+struct rdma_netdev {
+   void  *clnt_priv;
+   struct ib_device  *hca;
+   u8 port_num;
+
+   /* control functions */
+   void (*set_id)(struct net_device *netdev, int id);
+};
+
 struct ib_device {
char  name[IB_DEVICE_NAME_MAX];
 
@@ -2127,6 +2146,20 @@ struct ib_device {
   struct 
ib_rwq_ind_table_init_attr *init_attr,
   struct ib_udata 
*udata);
int(*destroy_rwq_ind_table)(struct 
ib_rwq_ind_table *wq_ind_table);
+   /**
+* rdma netdev operations
+*
+* Driver implementing alloc_rdma_netdev must return -EOPNOTSUPP if it
+* doesn't support the specified rdma netdev type.
+*/
+   struct net_device *(*alloc_rdma_netdev)(
+   struct ib_device *device,
+   u8 port_num,
+   enum rdma_netdev_t type,
+   const char *name,
+   unsigned char name_assign_type,
+   void (*setup)(struct net_device *));
+   void (*free_rdma_netdev)(struct net_device *netdev);
 
struct module   *owner;
struct devicedev;
-- 
1.8.3.1

[PATCH rdma-next v2 04/12] IB/opa-vnic: Virtual Network Interface Controller (VNIC) netdev

OPA VNIC netdev function supports Ethernet functionality over Omni-Path
fabric by encapsulating Ethernet packets inside Omni-Path packet header.
It allocates a rdma netdev device and interfaces with the network stack to
provide standard Ethernet network interfaces. It overrides HFI1 device's
netdev operations where it is required.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Sudeep Dutt 
Signed-off-by: Andrzej Kacprowski 
---
 MAINTAINERS|   7 +
 drivers/infiniband/Kconfig |   1 +
 drivers/infiniband/ulp/Makefile|   1 +
 drivers/infiniband/ulp/opa_vnic/Kconfig|   8 +
 drivers/infiniband/ulp/opa_vnic/Makefile   |   6 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c   | 239 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h   |  62 ++
 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c |  65 ++
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 186 
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  | 227 +++
 10 files changed, 802 insertions(+)
 create mode 100644 drivers/infiniband/ulp/opa_vnic/Kconfig
 create mode 100644 drivers/infiniband/ulp/opa_vnic/Makefile
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c776906..fc32256 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5843,6 +5843,13 @@ F:   drivers/block/cciss*
 F: include/linux/cciss_ioctl.h
 F: include/uapi/linux/cciss_ioctl.h
 
+OPA-VNIC DRIVER
+M: Dennis Dalessandro 
+M: Niranjana Vishwanathapura 
+L: linux-r...@vger.kernel.org
+S: Supported
+F: drivers/infiniband/ulp/opa_vnic
+
 HFI1 DRIVER
 M: Mike Marciniszyn 
 M: Dennis Dalessandro 
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 66f8602..234fe01 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -85,6 +85,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
 
+source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index f3c7dcf..c28af18 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_INFINIBAND_SRP)+= srp/
 obj-$(CONFIG_INFINIBAND_SRPT)  += srpt/
 obj-$(CONFIG_INFINIBAND_ISER)  += iser/
 obj-$(CONFIG_INFINIBAND_ISERT) += isert/
+obj-$(CONFIG_INFINIBAND_OPA_VNIC)  += opa_vnic/
diff --git a/drivers/infiniband/ulp/opa_vnic/Kconfig 
b/drivers/infiniband/ulp/opa_vnic/Kconfig
new file mode 100644
index 000..48132ab
--- /dev/null
+++ b/drivers/infiniband/ulp/opa_vnic/Kconfig
@@ -0,0 +1,8 @@
+config INFINIBAND_OPA_VNIC
+   tristate "Intel OPA VNIC support"
+   depends on X86_64 && INFINIBAND
+   ---help---
+   This is Omni-Path (OPA) Virtual Network Interface Controller (VNIC)
+   driver for Ethernet over Omni-Path feature. It implements the HW
+   independent VNIC functionality. It interfaces with Linux stack for
+   data path and IB MAD for the control path.
diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile 
b/drivers/infiniband/ulp/opa_vnic/Makefile
new file mode 100644
index 000..975c313
--- /dev/null
+++ b/drivers/infiniband/ulp/opa_vnic/Makefile
@@ -0,0 +1,6 @@
+# Makefile - Intel Omni-Path Virtual Network Controller driver
+# Copyright(c) 2017, Intel Corporation.
+#
+obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o
+
+opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
new file mode 100644
index 000..c74d02a
--- /dev/null
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
@@ -0,0 +1,239 @@
+/*
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR

[PATCH rdma-next v2 07/12] IB/opa-vnic: VNIC MAC table support

OPA VNIC MAC table contains the MAC address to DLID mappings provided by
the Ethernet manager. During transmission, the MAC table provides the MAC
address to DLID translation. Implement MAC table using simple hash list.
Also provide support to update/query the MAC table by Ethernet manager.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c   | 236 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|  51 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  |   4 +
 3 files changed, 291 insertions(+)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
index c74d02a..2e8fee9 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
@@ -96,6 +96,238 @@ static inline void opa_vnic_make_header(u8 *hdr, u32 slid, 
u32 dlid, u16 len,
memcpy(hdr, h, OPA_VNIC_HDR_LEN);
 }
 
+/*
+ * Using a simple hash table for mac table implementation with the last octet
+ * of mac address as a key.
+ */
+static void opa_vnic_free_mac_tbl(struct hlist_head *mactbl)
+{
+   struct opa_vnic_mac_tbl_node *node;
+   struct hlist_node *tmp;
+   int bkt;
+
+   if (!mactbl)
+   return;
+
+   vnic_hash_for_each_safe(mactbl, bkt, tmp, node, hlist) {
+   hash_del(&node->hlist);
+   kfree(node);
+   }
+   kfree(mactbl);
+}
+
+static struct hlist_head *opa_vnic_alloc_mac_tbl(void)
+{
+   u32 size = sizeof(struct hlist_head) * OPA_VNIC_MAC_TBL_SIZE;
+   struct hlist_head *mactbl;
+
+   mactbl = kzalloc(size, GFP_KERNEL);
+   if (!mactbl)
+   return ERR_PTR(-ENOMEM);
+
+   vnic_hash_init(mactbl);
+   return mactbl;
+}
+
+/* opa_vnic_release_mac_tbl - empty and free the mac table */
+void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter)
+{
+   struct hlist_head *mactbl;
+
+   mutex_lock(&adapter->mactbl_lock);
+   mactbl = rcu_access_pointer(adapter->mactbl);
+   rcu_assign_pointer(adapter->mactbl, NULL);
+   synchronize_rcu();
+   opa_vnic_free_mac_tbl(mactbl);
+   mutex_unlock(&adapter->mactbl_lock);
+}
+
+/*
+ * opa_vnic_query_mac_tbl - query the mac table for a section
+ *
+ * This function implements query of specific function of the mac table.
+ * The function also expects the requested range to be valid.
+ */
+void opa_vnic_query_mac_tbl(struct opa_vnic_adapter *adapter,
+   struct opa_veswport_mactable *tbl)
+{
+   struct opa_vnic_mac_tbl_node *node;
+   struct hlist_head *mactbl;
+   int bkt;
+   u16 loffset, lnum_entries;
+
+   rcu_read_lock();
+   mactbl = rcu_dereference(adapter->mactbl);
+   if (!mactbl)
+   goto get_mac_done;
+
+   loffset = be16_to_cpu(tbl->offset);
+   lnum_entries = be16_to_cpu(tbl->num_entries);
+
+   vnic_hash_for_each(mactbl, bkt, node, hlist) {
+   struct __opa_vnic_mactable_entry *nentry = &node->entry;
+   struct opa_veswport_mactable_entry *entry;
+
+   if ((node->index < loffset) ||
+   (node->index >= (loffset + lnum_entries)))
+   continue;
+
+   /* populate entry in the tbl corresponding to the index */
+   entry = &tbl->tbl_entries[node->index - loffset];
+   memcpy(entry->mac_addr, nentry->mac_addr,
+  ARRAY_SIZE(entry->mac_addr));
+   memcpy(entry->mac_addr_mask, nentry->mac_addr_mask,
+  ARRAY_SIZE(entry->mac_addr_mask));
+   entry->dlid_sd = cpu_to_be32(nentry->dlid_sd);
+   }
+   tbl->mac_tbl_digest = cpu_to_be32(adapter->info.vport.mac_tbl_digest);
+get_mac_done:
+   rcu_read_unlock();
+}
+
+/*
+ * opa_vnic_update_mac_tbl - update mac table section
+ *
+ * This function updates the specified section of the mac table.
+ * The procedure includes following steps.
+ *  - Allocate a new mac (hash) table.
+ *  - Add the specified entries to the new table.
+ *(except the ones that are requested to be deleted).
+ *  - Add all the other entries from the old mac table.
+ *  - If there is a failure, free the new table and return.
+ *  - Switch to the new table.
+ *  - Free the old table and return.
+ *
+ * The function also expects the requested range to be valid.
+ */
+int opa_vnic_update_mac_tbl(struct opa_vnic_adapter *adapter,
+   struct opa_veswport_mactable *tbl)
+{
+   struct opa_vnic_mac_tbl_node *node, *new_node;
+   struct hlist_head *new_mactbl, *old_mactbl;
+   int i, bkt, rc = 0;
+   u8 key;
+   u16 loffset, lnum_entries;
+
+   mutex_lock(&adapter->mactbl_lock);
+   /* allocate new mac table */
+   new_mactbl = opa_vnic_alloc_mac_tbl();
+   if (IS_ERR(

[PATCH rdma-next v2 10/12] IB/hfi1: OPA_VNIC RDMA netdev support

Add support to create and free OPA_VNIC rdma netdev devices.
Implement netstack interface functionality including xmit_skb,
receive side NAPI etc. Also implement rdma netdev control functions.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Andrzej Kacprowski 
---
 drivers/infiniband/hw/hfi1/Makefile|   2 +-
 drivers/infiniband/hw/hfi1/driver.c|  25 +-
 drivers/infiniband/hw/hfi1/hfi.h   |  27 +-
 drivers/infiniband/hw/hfi1/init.c  |   9 +-
 drivers/infiniband/hw/hfi1/vnic.h  | 153 
 drivers/infiniband/hw/hfi1/vnic_main.c | 644 +
 6 files changed, 853 insertions(+), 7 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/vnic.h
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_main.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile 
b/drivers/infiniband/hw/hfi1/Makefile
index 0cf97a0..2280538 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -12,7 +12,7 @@ hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
-   verbs_txreq.o
+   verbs_txreq.o vnic_main.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/infiniband/hw/hfi1/driver.c 
b/drivers/infiniband/hw/hfi1/driver.c
index 64bdbce..e4dc6a5 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -60,6 +60,7 @@
 #include "qp.h"
 #include "sdma.h"
 #include "debugfs.h"
+#include "vnic.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -1381,15 +1382,31 @@ int process_receive_ib(struct hfi1_packet *packet)
return RHF_RCV_CONTINUE;
 }
 
+static inline bool hfi1_is_vnic_packet(struct hfi1_packet *packet)
+{
+   /* Packet received in VNIC context via RSM */
+   if (packet->rcd->is_vnic)
+   return true;
+
+   if ((HFI1_GET_L2_TYPE(packet->ebuf) == OPA_VNIC_L2_TYPE) &&
+   (HFI1_GET_L4_TYPE(packet->ebuf) == OPA_VNIC_L4_ETHR))
+   return true;
+
+   return false;
+}
+
 int process_receive_bypass(struct hfi1_packet *packet)
 {
struct hfi1_devdata *dd = packet->rcd->dd;
 
-   if (unlikely(rhf_err_flags(packet->rhf)))
+   if (unlikely(rhf_err_flags(packet->rhf))) {
handle_eflags(packet);
+   } else if (hfi1_is_vnic_packet(packet)) {
+   hfi1_vnic_bypass_rcv(packet);
+   return RHF_RCV_CONTINUE;
+   }
 
-   dd_dev_err(dd,
-  "Bypass packets are not supported in normal operation. 
Dropping\n");
+   dd_dev_err(dd, "Unsupported bypass packet. Dropping\n");
incr_cntr64(&dd->sw_rcv_bypass_packet_errors);
if (!(dd->err_info_rcvport.status_and_code & OPA_EI_STATUS_SMASK)) {
u64 *flits = packet->ebuf;
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index a31638c..f85e8f4 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1,7 +1,7 @@
 #ifndef _HFI1_KERNEL_H
 #define _HFI1_KERNEL_H
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -337,6 +337,12 @@ struct hfi1_ctxtdata {
 * packets with the wrong interrupt handler.
 */
int (*do_interrupt)(struct hfi1_ctxtdata *rcd, int threaded);
+
+   /* Indicates that this is vnic context */
+   bool is_vnic;
+
+   /* vnic queue index this context is mapped to */
+   u8 vnic_q_idx;
 };
 
 /*
@@ -808,6 +814,19 @@ struct hfi1_asic_data {
struct hfi1_i2c_bus *i2c_bus1;
 };
 
+/*
+ * Number of VNIC contexts used. Ensure it is less than or equal to
+ * max queues supported by VNIC (HFI1_VNIC_MAX_QUEUE).
+ */
+#define HFI1_NUM_VNIC_CTXT   8
+
+/* Virtual NIC information */
+struct hfi1_vnic_data {
+   struct idr vesw_idr;
+};
+
+struct hfi1_vnic_vport_info;
+
 /* device data struct now contains only "general per-device" info.
  * fields related to a physical IB port are in a hfi1_pportdata struct.
  */
@@ -1115,6 +1134,9 @@ struct hfi1_devdata {
send_routine process_dma_send;
void (*pio_inline_send)(struct hfi1_devdata *dd, struct pio_buf *pbuf,
u64 pbc, const void *from, size_t count);
+   int (*process_vnic_dma_send)(struct hfi1_devdata *dd, u8 q_idx,
+s

[PATCH rdma-next v2 11/12] IB/hfi1: Virtual Network Interface Controller (VNIC) HW support

HFI1 HW specific support for VNIC functionality.
Dynamically allocate a set of contexts for VNIC when the first vnic
port is instantiated. Allocate VNIC contexts from user contexts pool
and return them back to the same pool while freeing up. Set aside
enough MSI-X interrupts for VNIC contexts and assign them when the
contexts are allocated. On the receive side, use an RSM rule to
spread TCP/UDP streams among VNIC contexts.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Andrzej Kacprowski 
---
 drivers/infiniband/hw/hfi1/aspm.h |  15 +-
 drivers/infiniband/hw/hfi1/chip.c | 291 +-
 drivers/infiniband/hw/hfi1/chip.h |   2 +
 drivers/infiniband/hw/hfi1/debugfs.c  |   8 +-
 drivers/infiniband/hw/hfi1/driver.c   |  52 --
 drivers/infiniband/hw/hfi1/file_ops.c |  27 ++-
 drivers/infiniband/hw/hfi1/hfi.h  |  29 ++-
 drivers/infiniband/hw/hfi1/init.c |  29 +--
 drivers/infiniband/hw/hfi1/mad.c  |  10 +-
 drivers/infiniband/hw/hfi1/pio.c  |  19 +-
 drivers/infiniband/hw/hfi1/pio.h  |   8 +-
 drivers/infiniband/hw/hfi1/sysfs.c|   4 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   8 +-
 drivers/infiniband/hw/hfi1/user_pages.c   |   5 +-
 drivers/infiniband/hw/hfi1/verbs.c|   6 +-
 drivers/infiniband/hw/hfi1/vnic.h |   3 +
 drivers/infiniband/hw/hfi1/vnic_main.c| 245 -
 include/rdma/opa_port_info.h  |   3 +-
 18 files changed, 660 insertions(+), 104 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/aspm.h 
b/drivers/infiniband/hw/hfi1/aspm.h
index 0d58fe3..794e681 100644
--- a/drivers/infiniband/hw/hfi1/aspm.h
+++ b/drivers/infiniband/hw/hfi1/aspm.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -229,14 +229,17 @@ static inline void aspm_ctx_timer_function(unsigned long 
data)
spin_unlock_irqrestore(&rcd->aspm_lock, flags);
 }
 
-/* Disable interrupt processing for verbs contexts when PSM contexts are open 
*/
+/*
+ * Disable interrupt processing for verbs contexts when PSM or VNIC contexts
+ * are open.
+ */
 static inline void aspm_disable_all(struct hfi1_devdata *dd)
 {
struct hfi1_ctxtdata *rcd;
unsigned long flags;
unsigned i;
 
-   for (i = 0; i < dd->first_user_ctxt; i++) {
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
rcd = dd->rcd[i];
del_timer_sync(&rcd->aspm_timer);
spin_lock_irqsave(&rcd->aspm_lock, flags);
@@ -260,7 +263,7 @@ static inline void aspm_enable_all(struct hfi1_devdata *dd)
if (aspm_mode != ASPM_MODE_DYNAMIC)
return;
 
-   for (i = 0; i < dd->first_user_ctxt; i++) {
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
rcd = dd->rcd[i];
spin_lock_irqsave(&rcd->aspm_lock, flags);
rcd->aspm_intr_enable = true;
@@ -276,7 +279,7 @@ static inline void aspm_ctx_init(struct hfi1_ctxtdata *rcd)
(unsigned long)rcd);
rcd->aspm_intr_supported = rcd->dd->aspm_supported &&
aspm_mode == ASPM_MODE_DYNAMIC &&
-   rcd->ctxt < rcd->dd->first_user_ctxt;
+   rcd->ctxt < rcd->dd->first_dyn_alloc_ctxt;
 }
 
 static inline void aspm_init(struct hfi1_devdata *dd)
@@ -286,7 +289,7 @@ static inline void aspm_init(struct hfi1_devdata *dd)
spin_lock_init(&dd->aspm_lock);
dd->aspm_supported = aspm_hw_l1_supported(dd);
 
-   for (i = 0; i < dd->first_user_ctxt; i++)
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++)
aspm_ctx_init(dd->rcd[i]);
 
/* Start with ASPM disabled */
diff --git a/drivers/infiniband/hw/hfi1/chip.c 
b/drivers/infiniband/hw/hfi1/chip.c
index 79a316a..e520929 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -126,9 +126,16 @@ struct flag_table {
 #define DEFAULT_KRCVQS   2
 #define MIN_KERNEL_KCTXTS 2
 #define FIRST_KERNEL_KCTXT1
-/* sizes for both the QP and RSM map tables */
-#define NUM_MAP_ENTRIES256
-#define NUM_MAP_REGS 32
+
+/*
+ * RSM instance allocation
+ *   0 - Verbs
+ *   1 - User Fecn Handling
+ *   2 - Vnic
+ */
+#define RSM_INS_VERBS 0
+#define RSM_INS_FECN  1
+#define RSM_INS_VNIC  2
 
 /* Bit offset into the GUID which carries HFI id information */
 #define GUID_HFI_INDEX_SHIFT 39
@@ -139,8 +146,7 @@ struct flag_table {
 #define is_emulator_p(dd) dd)->irev) & 0xf) == 3)
 #define is_emulator_s(dd) dd)->irev) & 0xf) == 4)
 
-/* RSM fields */
-
+/* RSM fields for Verbs */
 /* packet type */
 #define IB_PACKET_TYPE

[PATCH rdma-next v2 08/12] IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) interface

OPA VNIC EMA interface functions are the management interfaces to the OPA
VNIC netdev. Add support to add and remove VNIC ports. Implement the
required GET/SET management interface functions and processing of new
management information. Add support to send trap notifications upon various
events like interface status change, unicast/multicast mac list update and
mac address change.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
---
 drivers/infiniband/ulp/opa_vnic/Makefile   |   3 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h   |   4 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|  44 +++
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  | 142 +++-
 .../infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c  | 390 +
 5 files changed, 581 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c

diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile 
b/drivers/infiniband/ulp/opa_vnic/Makefile
index 975c313..e8d1ea1 100644
--- a/drivers/infiniband/ulp/opa_vnic/Makefile
+++ b/drivers/infiniband/ulp/opa_vnic/Makefile
@@ -3,4 +3,5 @@
 #
 obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o
 
-opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o
+opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o \
+  opa_vnic_vema_iface.o
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
index c025cde..4c434b9 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
@@ -99,6 +99,10 @@
 #define OPA_VNIC_DLID_SD_IS_SRC_MAC(dlid_sd)  (!!((dlid_sd) & 0x20))
 #define OPA_VNIC_DLID_SD_GET_DLID(dlid_sd)((dlid_sd) >> 8)
 
+/* VNIC Ethernet link status */
+#define OPA_VNIC_ETH_LINK_UP 1
+#define OPA_VNIC_ETH_LINK_DOWN   2
+
 /**
  * struct opa_vesw_info - OPA vnic switch information
  * @fabric_id: 10-bit fabric id
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
index bec4866..b49f5d7 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
@@ -161,14 +161,28 @@ struct __opa_veswport_trap {
 } __packed;
 
 /**
+ * struct opa_vnic_ctrl_port - OPA virtual NIC control port
+ * @ibdev: pointer to ib device
+ * @ops: opa vnic control operations
+ */
+struct opa_vnic_ctrl_port {
+   struct ib_device   *ibdev;
+   struct opa_vnic_ctrl_ops   *ops;
+};
+
+/**
  * struct opa_vnic_adapter - OPA VNIC netdev private data structure
  * @netdev: pointer to associated netdev
  * @ibdev: ib device
+ * @cport: pointer to opa vnic control port
  * @rn_ops: rdma netdev's net_device_ops
  * @port_num: OPA port number
  * @vport_num: vesw port number
  * @lock: adapter lock
  * @info: virtual ethernet switch port information
+ * @vema_mac_addr: mac address configured by vema
+ * @umac_hash: unicast maclist hash
+ * @mmac_hash: multicast maclist hash
  * @mactbl: hash table of MAC entries
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
@@ -177,6 +191,7 @@ struct __opa_veswport_trap {
 struct opa_vnic_adapter {
struct net_device *netdev;
struct ib_device  *ibdev;
+   struct opa_vnic_ctrl_port *cport;
const struct net_device_ops   *rn_ops;
 
u8 port_num;
@@ -186,6 +201,9 @@ struct opa_vnic_adapter {
struct mutex lock;
 
struct __opa_veswport_info  info;
+   u8  vema_mac_addr[ETH_ALEN];
+   u32 umac_hash;
+   u32 mmac_hash;
struct hlist_head  __rcu   *mactbl;
 
/* Lock used to protect updates to mac table */
@@ -225,6 +243,11 @@ struct opa_vnic_mac_tbl_node {
 #define v_warn(format, arg...) \
netdev_warn(adapter->netdev, format, ## arg)
 
+#define c_err(format, arg...) \
+   dev_err(&cport->ibdev->dev, format, ## arg)
+#define c_info(format, arg...) \
+   dev_info(&cport->ibdev->dev, format, ## arg)
+
 /* The maximum allowed entries in the mac table */
 #define OPA_VNIC_MAC_TBL_MAX_ENTRIES  2048
 /* Limit of smac entries in mac table */
@@ -264,11 +287,32 @@ struct opa_vnic_adapter *opa_vnic_add_netdev(struct 
ib_device *ibdev,
 void opa_vnic_encap_skb(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
 u8 opa_vnic_get_vl(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
 u8 opa_vnic_calc_entropy(struct opa_vnic_adapter *adapter, struct sk_buff 
*skb);
+void opa_vnic_process_vema_config(struct opa_vnic_adapter *adapter);
 void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter);
 void opa_vnic_query_mac_tbl(struct opa_vnic_adapter *adapter,
struct opa_veswport_mactable *tbl);
 int opa_vnic_update_mac_tbl(struct opa_vnic_

[PATCH rdma-next v2 09/12] IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) function

OPA VEMA function interfaces with the Infiniband MAD stack to exchange the
management information packets with the Ethernet Manager (EM).
It interfaces with the OPA VNIC netdev function to SET/GET the management
information. The information exchanged with the EM includes class port
details, encapsulation configuration, various counters, unicast and
multicast MAC list and the MAC table. It also supports sending traps
to the EM.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sudeep Dutt 
---
 drivers/infiniband/ulp/opa_vnic/Makefile   |2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c |   12 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|   17 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c| 1078 
 .../infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c  |2 +-
 5 files changed, 1106 insertions(+), 5 deletions(-)
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c

diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile 
b/drivers/infiniband/ulp/opa_vnic/Makefile
index e8d1ea1..8061b28 100644
--- a/drivers/infiniband/ulp/opa_vnic/Makefile
+++ b/drivers/infiniband/ulp/opa_vnic/Makefile
@@ -4,4 +4,4 @@
 obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o
 
 opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o \
-  opa_vnic_vema_iface.o
+  opa_vnic_vema.o opa_vnic_vema_iface.o
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
index a98948c..d66540e 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
@@ -120,6 +120,17 @@ struct vnic_stats {
 
 #define VNIC_STATS_LEN  ARRAY_SIZE(vnic_gstrings_stats)
 
+/* vnic_get_drvinfo - get driver info */
+static void vnic_get_drvinfo(struct net_device *netdev,
+struct ethtool_drvinfo *drvinfo)
+{
+   strlcpy(drvinfo->driver, opa_vnic_driver_name, sizeof(drvinfo->driver));
+   strlcpy(drvinfo->version, opa_vnic_driver_version,
+   sizeof(drvinfo->version));
+   strlcpy(drvinfo->bus_info, dev_name(netdev->dev.parent),
+   sizeof(drvinfo->bus_info));
+}
+
 /* vnic_get_sset_count - get string set count */
 static int vnic_get_sset_count(struct net_device *netdev, int sset)
 {
@@ -162,6 +173,7 @@ static void vnic_get_strings(struct net_device *netdev, u32 
stringset, u8 *data)
 
 /* ethtool ops */
 static const struct ethtool_ops opa_vnic_ethtool_ops = {
+   .get_drvinfo = vnic_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = vnic_get_strings,
.get_sset_count = vnic_get_sset_count,
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
index b49f5d7..6bba886 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
@@ -164,10 +164,12 @@ struct __opa_veswport_trap {
  * struct opa_vnic_ctrl_port - OPA virtual NIC control port
  * @ibdev: pointer to ib device
  * @ops: opa vnic control operations
+ * @num_ports: number of opa ports
  */
 struct opa_vnic_ctrl_port {
struct ib_device   *ibdev;
struct opa_vnic_ctrl_ops   *ops;
+   u8  num_ports;
 };
 
 /**
@@ -187,6 +189,8 @@ struct opa_vnic_ctrl_port {
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
  * @flow_tbl: flow to default port redirection table
+ * @trap_timeout: trap timeout
+ * @trap_count: no. of traps allowed within timeout period
  */
 struct opa_vnic_adapter {
struct net_device *netdev;
@@ -213,6 +217,9 @@ struct opa_vnic_adapter {
struct mutex stats_lock;
 
u8 flow_tbl[OPA_VNIC_FLOW_TBL_SIZE];
+
+   unsigned long trap_timeout;
+   u8trap_count;
 };
 
 /* Same as opa_veswport_mactable_entry, but without bitwise attribute */
@@ -247,6 +254,8 @@ struct opa_vnic_mac_tbl_node {
dev_err(&cport->ibdev->dev, format, ## arg)
 #define c_info(format, arg...) \
dev_info(&cport->ibdev->dev, format, ## arg)
+#define c_dbg(format, arg...) \
+   dev_dbg(&cport->ibdev->dev, format, ## arg)
 
 /* The maximum allowed entries in the mac table */
 #define OPA_VNIC_MAC_TBL_MAX_ENTRIES  2048
@@ -281,6 +290,9 @@ struct opa_vnic_mac_tbl_node {
!obj && (bkt) < OPA_VNIC_MAC_TBL_SIZE; (bkt)++)   \
hlist_for_each_entry(obj, &name[bkt], member)
 
+extern char opa_vnic_driver_name[];
+extern const char opa_vnic_driver_version[];
+
 struct opa_vnic_adapter *opa_vnic_add_netdev(struct ib_device *ibdev,
 u8 port_num, u8 vport_num);
 void opa_vnic_rem_netdev(struct opa_vnic_adapter *adapter);
@@ -310,9 +322,8 @@ void opa_vnic_get_per_veswport_info(struct opa_vni

[PATCH rdma-next v2 01/12] IB/opa-vnic: Virtual Network Interface Controller (VNIC) documentation

Add OPA VNIC design document explaining the VNIC architecture and the
driver design.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 Documentation/infiniband/opa_vnic.txt | 153 ++
 1 file changed, 153 insertions(+)
 create mode 100644 Documentation/infiniband/opa_vnic.txt

diff --git a/Documentation/infiniband/opa_vnic.txt 
b/Documentation/infiniband/opa_vnic.txt
new file mode 100644
index 000..282e17b
--- /dev/null
+++ b/Documentation/infiniband/opa_vnic.txt
@@ -0,0 +1,153 @@
+Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature
+supports Ethernet functionality over Omni-Path fabric by encapsulating
+the Ethernet packets between HFI nodes.
+
+Architecture
+=
+The patterns of exchanges of Omni-Path encapsulated Ethernet packets
+involves one or more virtual Ethernet switches overlaid on the Omni-Path
+fabric topology. A subset of HFI nodes on the Omni-Path fabric are
+permitted to exchange encapsulated Ethernet packets across a particular
+virtual Ethernet switch. The virtual Ethernet switches are logical
+abstractions achieved by configuring the HFI nodes on the fabric for
+header generation and processing. In the simplest configuration all HFI
+nodes across the fabric exchange encapsulated Ethernet packets over a
+single virtual Ethernet switch. A virtual Ethernet switch, is effectively
+an independent Ethernet network. The configuration is performed by an
+Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
+application. HFI nodes can have multiple VNICs each connected to a
+different virtual Ethernet switch. The below diagram presents a case
+of two virtual Ethernet switches with two HFI nodes.
+
+ +---+
+ |  Subnet/  |
+ | Ethernet  |
+ |  Manager  |
+ +---+
+/  /
+  /   /
+//
+  / /
++-+  +--+
+|  Virtual Ethernet Switch|  |  Virtual Ethernet Switch |
+|  +-++-+ |  | +-++-+   |
+|  | VPORT   ||  VPORT  | |  | |  VPORT  ||  VPORT  |   |
++--+-++-+-+  +-+-++-+---+
+ | \/ |
+ |   \/   |
+ | \/ |
+ |/  \|
+ |  /  \  |
+ +---++  +---++
+ |   VNIC|VNIC|  |VNIC   |VNIC|
+ +---++  +---++
+ |  HFI   |  |  HFI   |
+ ++  ++
+
+
+The Omni-Path encapsulated Ethernet packet format is as described below.
+
+Bits  Field
+
+Quad Word 0:
+0-19  SLID (lower 20 bits)
+20-30 Length (in Quad Words)
+31BECN bit
+32-51 DLID (lower 20 bits)
+52-56 SC (Service Class)
+57-59 RC (Routing Control)
+60FECN bit
+61-62 L2 (=10, 16B format)
+63LT (=1, Link Transfer Head Flit)
+
+Quad Word 1:
+0-7   L4 type (=0x78 ETHERNET)
+8-11  SLID[23:20]
+12-15 DLID[23:20]
+16-31 PKEY
+32-47 Entropy
+48-63 Reserved
+
+Quad Word 2:
+0-15  Reserved
+16-31 L4 header
+32-63 Ethernet Packet
+
+Quad Words 3 to N-1:
+0-63  Ethernet packet (pad extended)
+
+Quad Word N (last):
+0-23  Ethernet packet (pad extended)
+24-55 ICRC
+56-61 Tail
+62-63 LT (=01, Link Transfer Tail Flit)
+
+Ethernet packet is padded on the transmit side to ensure that the VNIC OPA
+packet is quad word aligned. The 'Tail' field contains the number of bytes
+padded. On the receive side the 'Tail' field is read and the padding is
+removed (along with ICRC, Tail and OPA header) before passing packet up
+the network stack.
+
+The L4 header field contains the virtual Ethernet switch id the VNIC port
+belongs to. On the receive side, this field is used to de-multiplex the
+received VNIC packets to different VNIC ports.
+
+Driver Design
+==
+Intel OPA VNIC software design is presented in the below diagram.
+OPA VNIC functionality has a HW dependent component and a HW
+independent component.
+
+The support has been added for IB device to allocate and free the RDMA
+netdev devices. The RDMA netdev supports interfacing with the network
+stack thus creating standard network interfaces. OPA_VNIC is an RDMA
+netdev device type.
+
+The HW dependent VNIC functionality is part of

[PATCH rdma-next v2 00/12] Omni-Path Virtual Network Interface Controller (VNIC)

ChangeLog:
=
v1 => v2:
a) Change error code for unsupported rdma netdev type
b) Remove some debug messages

v1: posted @ https://www.spinics.net/lists/linux-rdma/msg48518.html

v0 => v1:
a) changes as required by new kernel base (for-4.12)
b) moved rdma netdev interface into a separate patch 
c) Return specific error code if specified rdma netdev type is not supported
d) Some minor fixes (no changes to overall design or interface)

v0: Initial post @ https://www.spinics.net/lists/linux-rdma/msg46604.html

Description:

Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature
supports Ethernet functionality over Omni-Path fabric by encapsulating
the Ethernet packets between HFI nodes.

Architecture
=
The patterns of exchanges of Omni-Path encapsulated Ethernet packets
involves one or more virtual Ethernet switches overlaid on the Omni-Path
fabric topology. A subset of HFI nodes on the Omni-Path fabric are
permitted to exchange encapsulated Ethernet packets across a particular
virtual Ethernet switch. The virtual Ethernet switches are logical
abstractions achieved by configuring the HFI nodes on the fabric for
header generation and processing. In the simplest configuration all HFI
nodes across the fabric exchange encapsulated Ethernet packets over a
single virtual Ethernet switch. A virtual Ethernet switch, is effectively
an independent Ethernet network. The configuration is performed by an
Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
application. HFI nodes can have multiple VNICs each connected to a
different virtual Ethernet switch. The below diagram presents a case
of two virtual Ethernet switches with two HFI nodes.

 +---+
 |  Subnet/  |
 | Ethernet  |
 |  Manager  |
 +---+
/  /
  /   /
//
  / /
+-+  +--+
|  Virtual Ethernet Switch|  |  Virtual Ethernet Switch |
|  +-++-+ |  | +-++-+   |
|  | VPORT   ||  VPORT  | |  | |  VPORT  ||  VPORT  |   |
+--+-++-+-+  +-+-++-+---+
 | \/ |
 |   \/   |
 | \/ |
 |/  \|
 |  /  \  |
 +---++  +---++
 |   VNIC|VNIC|  |VNIC   |VNIC|
 +---++  +---++
 |  HFI   |  |  HFI   |
 ++  ++


The Omni-Path encapsulated Ethernet packet format is as described below.

Bits  Field

Quad Word 0:
0-19  SLID (lower 20 bits)
20-30 Length (in Quad Words)
31BECN bit
32-51 DLID (lower 20 bits)
52-56 SC (Service Class)
57-59 RC (Routing Control)
60FECN bit
61-62 L2 (=10, 16B format)
63LT (=1, Link Transfer Head Flit)

Quad Word 1:
0-7   L4 type (=0x78 ETHERNET)
8-11  SLID[23:20]
12-15 DLID[23:20]
16-31 PKEY
32-47 Entropy
48-63 Reserved

Quad Word 2:
0-15  Reserved
16-31 L4 header
32-63 Ethernet Packet

Quad Words 3 to N-1:
0-63  Ethernet packet (pad extended)

Quad Word N (last):
0-23  Ethernet packet (pad extended)
24-55 ICRC
56-61 Tail
62-63 LT (=01, Link Transfer Tail Flit)

Ethernet packet is padded on the transmit side to ensure that the VNIC OPA
packet is quad word aligned. The 'Tail' field contains the number of bytes
padded. On the receive side the 'Tail' field is read and the padding is
removed (along with ICRC, Tail and OPA header) before passing packet up
the network stack.

The L4 header field contains the virtual Ethernet switch id the VNIC port
belongs to. On the receive side, this field is used to de-multiplex the
received VNIC packets to different VNIC ports.

Driver Design
==
Intel OPA VNIC software design is presented in the below diagram.
OPA VNIC functionality has a HW dependent component and a HW
independent component.

The support has been added for IB device to allocate and free the RDMA
netdev devices. The RDMA netdev supports interfacing with the network
stack thus creating standard network interfaces. OPA_VNIC is an RDMA
netdev device type.

The HW dependent VNIC functionality is part of the HFI1 driver. It
implements the verbs to allocate and free the OPA_VNIC RDMA netdev.
It involves HW resource allocation/management f

[PATCH rdma-next v2 03/12] IB/opa-vnic: Virtual Network Interface Controller (VNIC) interface

Define OPA VNIC interface between hardware independent VNIC
functionality and the hardware dependent VNIC functionality.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 include/rdma/ib_verbs.h |   1 +
 include/rdma/opa_vnic.h | 141 
 2 files changed, 142 insertions(+)
 create mode 100644 include/rdma/opa_vnic.h

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 5c6b8c0..88abef8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -225,6 +225,7 @@ enum ib_device_cap_flags {
IB_DEVICE_VIRTUAL_FUNCTION  = (1ULL << 33),
/* Deprecated. Please use IB_RAW_PACKET_CAP_SCATTER_FCS. */
IB_DEVICE_RAW_SCATTER_FCS   = (1ULL << 34),
+   IB_DEVICE_RDMA_NETDEV_OPA_VNIC  = (1ULL << 35),
 };
 
 enum ib_signature_prot_cap {
diff --git a/include/rdma/opa_vnic.h b/include/rdma/opa_vnic.h
new file mode 100644
index 000..39d6890
--- /dev/null
+++ b/include/rdma/opa_vnic.h
@@ -0,0 +1,141 @@
+#ifndef _OPA_VNIC_H
+#define _OPA_VNIC_H
+/*
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+/*
+ * This file contains Intel Omni-Path (OPA) Virtual Network Interface
+ * Controller (VNIC) specific declarations.
+ */
+
+#include 
+
+/* VNIC uses 16B header format */
+#define OPA_VNIC_L2_TYPE0x2
+
+/* 16 header bytes + 2 reserved bytes */
+#define OPA_VNIC_L2_HDR_LEN   (16 + 2)
+
+#define OPA_VNIC_L4_HDR_LEN   2
+
+#define OPA_VNIC_HDR_LEN  (OPA_VNIC_L2_HDR_LEN + \
+  OPA_VNIC_L4_HDR_LEN)
+
+#define OPA_VNIC_L4_ETHR  0x78
+
+#define OPA_VNIC_ICRC_LEN   4
+#define OPA_VNIC_TAIL_LEN   1
+#define OPA_VNIC_ICRC_TAIL_LEN  (OPA_VNIC_ICRC_LEN + OPA_VNIC_TAIL_LEN)
+
+#define OPA_VNIC_SKB_MDATA_LEN 4
+#define OPA_VNIC_SKB_MDATA_ENCAP_ERR   0x1
+
+/* opa vnic rdma netdev's private data structure */
+struct opa_vnic_rdma_netdev {
+   struct rdma_netdev rn;  /* keep this first */
+   /* followed by device private data */
+   char *dev_priv[0];
+};
+
+static inline void *opa_vnic_priv(const struct net_device *dev)
+{
+   struct rdma_netdev *rn = netdev_priv(dev);
+
+   return rn->clnt_priv;
+}
+
+static inline void *opa_vnic_dev_priv(const struct net_device *dev)
+{
+   struct opa_vnic_rdma_netdev *oparn = netdev_priv(dev);
+
+   return oparn->dev_priv;
+}
+
+/* opa_vnic skb meta data structrue */
+struct opa_vnic_skb_mdata {
+   u8 vl;
+   u8 entropy;
+   u8 flags;
+   u8 rsvd;
+} __packed;
+
+/* OPA VNIC group statistics */
+struct opa_vnic_grp_stats {
+   u64 unicast;
+   u64 mcastbcast;
+   u64 untagged;
+   u64 vlan;
+   u64 s_64;
+   u64 s_65_127;
+   u64 s_128_255;
+   u64 s_256_511;
+

Re: eBPF - little-endian load instructions?

2017-04-12 Thread Alexei Starovoitov

On Wed, Apr 12, 2017 at 09:38:39PM +0200, Johannes Berg wrote:
> On Wed, 2017-04-12 at 09:58 -0700, Alexei Starovoitov wrote:
> > 
> > > Are these hooked up to llvm intrinsics or so? If not, can I do that
> > > through some kind of inline asm statement?
> > 
> > llvm doesn't support bpf inline asm yet.
> 
> Ok.
> 
> > > In the samples, I only see people doing
> > > 
> > > #define _htonl __builtin_bswap32
> > > 
> > > but I'm not even completely convinced that's correct, since it
> > > assumes
> > > a little-endian host?
> > 
> > oh well, time to face the music.
> > 
> > In llvm backend I did:
> > // bswap16, bswap32, bswap64
> > class BSWAP SizeOp, string OpcodeStr, list Pattern>
> > ...
> >   let op = 0xd; // BPF_END
> >   let BPFSrc = 1;   // BPF_TO_BE (TODO: use BPF_TO_LE for big-endian
> > target)
> >   let BPFClass = 4; // BPF_ALU
> > 
> > so __builtin_bswap32() is not a normal bswap. It's only doing bswap
> > if the compiled program running on little endian arch.
> > The plan was to fix it up for -march=bpfeb target (hence the comment
> > above), but it turned out that such __builtin_bswap32 matches
> > perfectly to _htonl() semantics, so I left it as-is even for
> > -march=bpfeb.
> > 
> > On little endian:
> > ld_abs_W = *(u32*) + real bswap32
> > __builtin_bswap32() == bpf_to_be insn = real bswap32
> > 
> > On big endian:
> > ld_abs_W = *(u32*)
> > __builtin_bswap32() == bpf_to_be insn = nop
> > 
> > so in samples/bpf/*.c:
> > load_word() + _htonl()(__builtin_bswap32) has the same semantics
> > for both little and big endian archs, hence all networking sample
> > code in
> > samples/bpf/*_kern.c works fine.
> > 
> > imo the whole thing is crazy ugly. llvm doesn't have 'htonl'
> > equivalent builtin, so builtin_bswap was the closest I could use to
> > generate bpf_to_[bl]e insn.
> > 
> 
> Awkward. How can this even be fixed without breaking all the existing
> code?

it's really llvm bug that i need fix. It's plain broken
to generate what effectively is nop insn for march=bpfeb
My only excuse that when that code was written llvm had only march=bpfel.
bpfeb was added much later.

> I assume the BPF machine is intended to be endian-independent, which is
> really the problem - normally you'd either
>   #define be32_to_cpu bswap32
> or
>   #define be32_to_cpu(x) (x)
> depending on the build architecture, I guess.

yeah. that's what we should have in bpf_helpers.h

> > To solve this properly I think we need two things:
> > . proper bswap32 insn in BPF
> 
> Not sure you need that - what for? Normally this doesn't really get used 
> directly, I think? At least I don't really see a good reason for using it 
> directly. And reimplementing that now would break existing C code.

I think bswap is only used in crypto and things like crypto.
In the kernel it's 802.15.4 mac.
ntoh is enough for any networking code,
so I guess we can live without real bswap insn.

> > . extend llvm with bpf_to_be/le builtins
> > Both are not trivial amount of work.
> 
> It seems that perhaps the best way to solve this would be to actually
> implement inline assembly. Then, existing C code that relies on the
> (broken) bswap32 semantics can actually continue to work, if that
> built-in isn't touched, and one could then implement the various
> cpu_to_be32 and friends as inline assembly?
> 
> That would make it invisible to the LLVM optimiser though, so perhaps
> not the best idea either.

In llvm the inline asm is actually visible to optimizer (unlike gcc),
so it can be ok-ish.
Inline asm needs to be done anyway.

[PATCH net-next 2/8] net/ncsi: Properly track channel monitor timer state

The field @monitor.enabled in the NCSI channel descriptor is used
to track the state of channel monitor timer. It indicates the timer's
state (pending or not). We could not start the timer again in its
handler. In that case, We missed to update @monitor.enabled to false.
It leads to below warning printed by WARN_ON_ONCE() when the monitor
is restarted afterwards.

   [ cut here ]
   WARNING: CPU: 0 PID: 411 at /var/lib/jenkins/workspace/openbmc-build \
   /distro/ubuntu/target/palmetto/openbmc/build/tmp/work-shared/palmetto \
   net/ncsi/ncsi-manage.c:240 ncsi_start_channel_monitor+0x44/0x7c
   CPU: 0 PID: 411 Comm: kworker/0:3 Not tainted \
   4.7.10-f26558191540830589fe03932d05577957670b8d #1
   Hardware name: ASpeed SoC
   Workqueue: events ncsi_dev_work
   [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
   [] (show_stack) from [] (__warn+0xc4/0xf0)
   [] (__warn) from [] (warn_slowpath_null+0x1c/0x24)
   [] (warn_slowpath_null) from [] 
(ncsi_start_channel_monitor+0x44/0x7c)
   [] (ncsi_start_channel_monitor) from [] 
(ncsi_configure_channel+0x27c/0x2dc)
   [] (ncsi_configure_channel) from [] 
(ncsi_dev_work+0x39c/0x3e8)
   [] (ncsi_dev_work) from [] (process_one_work+0x1b8/0x2fc)
   [] (process_one_work) from [] (worker_thread+0x2c0/0x3f8)
   [] (worker_thread) from [] (kthread+0xd0/0xe8)
   [] (kthread) from [] (ret_from_fork+0x14/0x24)
   ---[ end trace 110cccf2b038c44d ]---

This fixes the issue by updating @monitor.enabled to false if needed.

Reported-by: Sridevi Ramesh 
Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-manage.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 5073e15..c71a3a5 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -183,11 +183,16 @@ static void ncsi_channel_monitor(unsigned long data)
monitor_state = nc->monitor.state;
spin_unlock_irqrestore(&nc->lock, flags);
 
-   if (!enabled || chained)
+   if (!enabled || chained) {
+   ncsi_stop_channel_monitor(nc);
return;
+   }
+
if (state != NCSI_CHANNEL_INACTIVE &&
-   state != NCSI_CHANNEL_ACTIVE)
+   state != NCSI_CHANNEL_ACTIVE) {
+   ncsi_stop_channel_monitor(nc);
return;
+   }
 
switch (monitor_state) {
case NCSI_CHANNEL_MONITOR_START:
@@ -199,6 +204,7 @@ static void ncsi_channel_monitor(unsigned long data)
nca.req_flags = 0;
ret = ncsi_xmit_cmd(&nca);
if (ret) {
+   ncsi_stop_channel_monitor(nc);
netdev_err(ndp->ndev.dev, "Error %d sending GLS\n",
   ret);
return;
@@ -218,6 +224,8 @@ static void ncsi_channel_monitor(unsigned long data)
nc->state = NCSI_CHANNEL_INVISIBLE;
spin_unlock_irqrestore(&nc->lock, flags);
 
+   ncsi_stop_channel_monitor(nc);
+
spin_lock_irqsave(&ndp->lock, flags);
nc->state = NCSI_CHANNEL_INACTIVE;
list_add_tail_rcu(&nc->link, &ndp->channel_queue);
@@ -257,6 +265,10 @@ void ncsi_stop_channel_monitor(struct ncsi_channel *nc)
nc->monitor.enabled = false;
spin_unlock_irqrestore(&nc->lock, flags);
 
+   /* The timer isn't in pending state if we're deleting the timer
+* in its handler. del_timer_sync() can detect it and just does
+* nothing.
+*/
del_timer_sync(&nc->monitor.timer);
 }
 
-- 
2.7.4

[PATCH net-next 3/8] net/ncsi: Enforce failover on link monitor timeout

The NCSI channel has been configured to provide service if its link
monitor timer is enabled, regardless of its state (inactive or active).
So the timeout event on the link monitor indicates the out-of-service
on that channel, for which a failover is needed.

This sets NCSI_DEV_RESHUFFLE flag to enforce failover on link monitor
timeout, regardless the channel's original state (inactive or active).
Also, the link is put into "down" state to give the failing channel
lowest priority when selecting for the active channel. The state of
failing channel should be set to active in order for deinitialization
and failover to be done.

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-manage.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index c71a3a5..13ad1f26 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -170,6 +170,7 @@ static void ncsi_channel_monitor(unsigned long data)
struct ncsi_channel *nc = (struct ncsi_channel *)data;
struct ncsi_package *np = nc->package;
struct ncsi_dev_priv *ndp = np->ndp;
+   struct ncsi_channel_mode *ncm;
struct ncsi_cmd_arg nca;
bool enabled, chained;
unsigned int monitor_state;
@@ -214,20 +215,21 @@ static void ncsi_channel_monitor(unsigned long data)
case NCSI_CHANNEL_MONITOR_WAIT ... NCSI_CHANNEL_MONITOR_WAIT_MAX:
break;
default:
-   if (!(ndp->flags & NCSI_DEV_HWA) &&
-   state == NCSI_CHANNEL_ACTIVE) {
+   if (!(ndp->flags & NCSI_DEV_HWA)) {
ncsi_report_link(ndp, true);
ndp->flags |= NCSI_DEV_RESHUFFLE;
}
 
+   ncm = &nc->modes[NCSI_MODE_LINK];
spin_lock_irqsave(&nc->lock, flags);
nc->state = NCSI_CHANNEL_INVISIBLE;
+   ncm->data[2] &= ~0x1;
spin_unlock_irqrestore(&nc->lock, flags);
 
ncsi_stop_channel_monitor(nc);
 
spin_lock_irqsave(&ndp->lock, flags);
-   nc->state = NCSI_CHANNEL_INACTIVE;
+   nc->state = NCSI_CHANNEL_ACTIVE;
list_add_tail_rcu(&nc->link, &ndp->channel_queue);
spin_unlock_irqrestore(&ndp->lock, flags);
ncsi_process_next_channel(ndp);
-- 
2.7.4

[PATCH net-next 1/8] net/ncsi: Disable HWA mode when no channels are found

When there are no NCSI channels probed, HWA (Hardware Arbitration)
mode is enabled. It's not correct because HWA depends on the fact:
NCSI channels exist and all of them support HWA mode. This disables
HWA when no channels are probed.

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-manage.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index a3bd5fa..5073e15 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -839,12 +839,15 @@ static bool ncsi_check_hwa(struct ncsi_dev_priv *ndp)
struct ncsi_package *np;
struct ncsi_channel *nc;
unsigned int cap;
+   bool has_channel = false;
 
/* The hardware arbitration is disabled if any one channel
 * doesn't support explicitly.
 */
NCSI_FOR_EACH_PACKAGE(ndp, np) {
NCSI_FOR_EACH_CHANNEL(np, nc) {
+   has_channel = true;
+
cap = nc->caps[NCSI_CAP_GENERIC].cap;
if (!(cap & NCSI_CAP_GENERIC_HWA) ||
(cap & NCSI_CAP_GENERIC_HWA_MASK) !=
@@ -855,8 +858,13 @@ static bool ncsi_check_hwa(struct ncsi_dev_priv *ndp)
}
}
 
-   ndp->flags |= NCSI_DEV_HWA;
-   return true;
+   if (has_channel) {
+   ndp->flags |= NCSI_DEV_HWA;
+   return true;
+   }
+
+   ndp->flags &= ~NCSI_DEV_HWA;
+   return false;
 }
 
 static int ncsi_enable_hwa(struct ncsi_dev_priv *ndp)
-- 
2.7.4

[PATCH net-next 0/8] net/ncsi: Add debugging functionality

This series supports NCSI debugging infrastructure by adding several
procfs files. It was inspired by the reported issues: No available
package and channel are probed successfully. Obviously, we don't
have a debugging infrastructure for NCSI stack yet.

The first 3 patches, fixing some issues, aren't relevant to the
subject. I included them because I expect they can be merged beofre
the code for debugging infrastructure. PATCH[4,5,6/8] adds procfs
directories and files to support the debugging infrastructure for
several purposes: presenting the NCSI topology; statistics on sent
and received NCSI packets; generate NCSI command packet manually.
PATCH[7,8/8] fixes two issues found from the debugging functionality.

Gavin Shan (8):
  net/ncsi: Disable HWA mode when no channels are found
  net/ncsi: Properly track channel monitor timer state
  net/ncsi: Enforce failover on link monitor timeout
  net/ncsi: Add debugging infrastructurre
  net/ncsi: Dump NCSI packet statistics
  net/ncsi: Support NCSI packet generation
  net/ncsi: No error report on DP response to non-existing package
  net/ncsi: Fix length of GVI response packet

 net/ncsi/Kconfig   |   9 +
 net/ncsi/Makefile  |   1 +
 net/ncsi/internal.h|  68 
 net/ncsi/ncsi-aen.c|  15 +-
 net/ncsi/ncsi-cmd.c|  23 +-
 net/ncsi/ncsi-debug.c  | 961 +
 net/ncsi/ncsi-manage.c |  63 +++-
 net/ncsi/ncsi-rsp.c|  37 +-
 8 files changed, 1165 insertions(+), 12 deletions(-)
 create mode 100644 net/ncsi/ncsi-debug.c

-- 
2.7.4

[PATCH net-next 6/8] net/ncsi: Support NCSI packet generation

This introduces /proc/ncsi/eth0/pkt. The procfs entry can accept
parameters to produce NCSI command packet. The received NCSI
response packet is dumped on read. Below is an example to send
CIS command and dump its response.

   # echo CIS,0,0 > /proc/ncsi/eth0/pkt
   # cat /proc/ncsi/eth0/pkt
   NCSI response [CIS] packet received

   00 01 dd 80 00 0004  

Signed-off-by: Gavin Shan 
---
 net/ncsi/internal.h|  13 +
 net/ncsi/ncsi-cmd.c|   5 +
 net/ncsi/ncsi-debug.c  | 693 ++---
 net/ncsi/ncsi-manage.c |   3 +
 net/ncsi/ncsi-rsp.c|  14 +-
 5 files changed, 697 insertions(+), 31 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 5978500..e91f701 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -221,6 +221,9 @@ struct ncsi_request {
bool used;/* Request that has been assigned  */
unsigned int flags;   /* NCSI request property   */
 #define NCSI_REQ_FLAG_EVENT_DRIVEN 1
+#ifdef CONFIG_NET_NCSI_DEBUG
+#define NCSI_REQ_FLAG_DEBUG2
+#endif
struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
struct sk_buff   *cmd;/* Associated NCSI command packet  */
struct sk_buff   *rsp;/* Associated NCSI response packet */
@@ -285,6 +288,14 @@ struct ncsi_dev_priv {
 #ifdef CONFIG_NET_NCSI_DEBUG
struct {
struct proc_dir_entry *pde;
+   unsigned int  req;
+#define NCSI_PKT_REQ_FREE  0
+#define NCSI_PKT_REQ_BUSY  0x
+   int   errno;
+   struct sk_buff*rsp;
+   } pkt;
+   struct {
+   struct proc_dir_entry *pde;
 #define NCSI_PKT_STAT_OK   0
 #define NCSI_PKT_STAT_TIMEOUT  1
 #define NCSI_PKT_STAT_ERROR2
@@ -364,6 +375,8 @@ int ncsi_package_init_debug(struct ncsi_package *np);
 void ncsi_package_release_debug(struct ncsi_package *np);
 int ncsi_channel_init_debug(struct ncsi_channel *nc);
 void ncsi_channel_release_debug(struct ncsi_channel *nc);
+void ncsi_dev_reset_pkt_debug(struct ncsi_dev_priv *ndp,
+ struct sk_buff *skb, int errno);
 #else
 static inline int ncsi_dev_init_debug(struct ncsi_dev_priv *ndp)
 {
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index c6b2bc5..3455465 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -365,6 +365,11 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
nr->enabled = true;
mod_timer(&nr->timer, jiffies + 1 * HZ);
 
+#ifdef CONFIG_NET_NCSI_DEBUG
+   if (nr->flags & NCSI_REQ_FLAG_DEBUG)
+   nca->ndp->pkt.req = nr->id;
+#endif
+
/* Send NCSI packet */
skb_get(nr->cmd);
ret = dev_queue_xmit(nr->cmd);
diff --git a/net/ncsi/ncsi-debug.c b/net/ncsi/ncsi-debug.c
index 7352981..016e580 100644
--- a/net/ncsi/ncsi-debug.c
+++ b/net/ncsi/ncsi-debug.c
@@ -23,40 +23,668 @@
 #include "ncsi-pkt.h"
 
 static struct proc_dir_entry *ncsi_pde;
+
+static const char *ncsi_pkt_type_name(unsigned int type);
+
+static int ncsi_pkt_input_default(struct ncsi_dev_priv *ndp,
+ struct ncsi_cmd_arg *nca, char *buf)
+{
+   return 0;
+}
+
+static int ncsi_pkt_input_params(char *buf, int *outval, int count)
+{
+   int num, i;
+
+   for (i = 0; i < count; i++, outval++) {
+   if (sscanf(buf, "%x%n", outval, &num) != 1)
+   return -EINVAL;
+
+   if (buf[num] == ',')
+   buf += (count + 1);
+   else
+   buf += count;
+   }
+
+   return 0;
+}
+
+static int ncsi_pkt_input_sp(struct ncsi_dev_priv *ndp,
+struct ncsi_cmd_arg *nca, char *buf)
+{
+   int param, ret;
+
+   /* The hardware arbitration will be configured according
+* to the NCSI's capability if it's not specified.
+*/
+   ret = ncsi_pkt_input_params(buf, ¶m, 1);
+   if (!ret && param != 0 && param != 1)
+   return -EINVAL;
+   else if (ret)
+   param = (ndp->flags & NCSI_DEV_HWA) ? 1 : 0;
+
+   nca->bytes[0] = param;
+
+   return 0;
+}
+
+static int ncsi_pkt_input_dc(struct ncsi_dev_priv *ndp,
+struct ncsi_cmd_arg *nca, char *buf)
+{
+   int param, ret;
+
+   /* Allow link down will be disallowed if it's not specified */
+   ret = ncsi_pkt_input_params(buf, ¶m, 1);
+   if (!ret && param != 0 && param != 1)
+   return -EINVAL;
+   else if (ret)
+   param = 0;
+
+   nca->bytes[0] = param;
+
+   return 0;
+}
+
+static int ncsi_pkt_input_ae(struct ncsi_dev_priv *ndp,
+struct ncsi_cmd_arg *nca, char *buf)
+{
+   int param[2], ret;
+
+   /* MC ID and AE mode are mandatory */
+   ret = ncsi_pkt_input_params(buf, param, 2);
+   if (ret)
+   return -EINV

[PATCH net-next 5/8] net/ncsi: Dump NCSI packet statistics

This creates /proc/ncsi//stats to dump the NCSI packets sent
and received over all packages and channels. It's useful to diagnose
NCSI problems, especially when NCSI packages and channels aren't
probed properly.

The statistics can be gained from procfs file as below:

 # cat /proc/ncsi/eth0/stats

 CMD  OK   TIMEOUT  ERROR
 ===
 CIS  32   29   0
 SP   10   70
 DP   17   14   0
 EC   100
 ECNT 100
 AE   100
 GLS  11   00
 SMA  100
 EBF  100
 GVI  200
 GC   200

 RSP  OK   TIMEOUT  ERROR
 ===
 CIS  300
 SP   300
 DP   201
 EC   100
 ECNT 100
 AE   100
 GLS  11   00
 SMA  100
 EBF  100
 GVI  002
 GC   200

 AEN  OK   TIMEOUT  ERROR
 ===

Signed-off-by: Gavin Shan 
---
 net/ncsi/internal.h|  10 +++
 net/ncsi/ncsi-aen.c|  15 +++-
 net/ncsi/ncsi-cmd.c|  18 +++-
 net/ncsi/ncsi-debug.c  | 233 +
 net/ncsi/ncsi-manage.c |   8 ++
 net/ncsi/ncsi-rsp.c|  21 -
 6 files changed, 302 insertions(+), 3 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 2a08168..5978500 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -283,6 +283,16 @@ struct ncsi_dev_priv {
struct packet_type  ptype;   /* NCSI packet Rx handler */
struct list_headnode;/* Form NCSI device list  */
 #ifdef CONFIG_NET_NCSI_DEBUG
+   struct {
+   struct proc_dir_entry *pde;
+#define NCSI_PKT_STAT_OK   0
+#define NCSI_PKT_STAT_TIMEOUT  1
+#define NCSI_PKT_STAT_ERROR2
+#define NCSI_PKT_STAT_MAX  3
+   unsigned long cmd[128][NCSI_PKT_STAT_MAX];
+   unsigned long rsp[128][NCSI_PKT_STAT_MAX];
+   unsigned long aen[256][NCSI_PKT_STAT_MAX];
+   } stats;
struct proc_dir_entry*pde;   /* Procfs directory   */
 #endif
 };
diff --git a/net/ncsi/ncsi-aen.c b/net/ncsi/ncsi-aen.c
index 6898e72..adcaa56 100644
--- a/net/ncsi/ncsi-aen.c
+++ b/net/ncsi/ncsi-aen.c
@@ -206,16 +206,29 @@ int ncsi_aen_handler(struct ncsi_dev_priv *ndp, struct 
sk_buff *skb)
}
 
if (!nah) {
+#ifdef CONFIG_NET_NCSI_DEBUG
+   ndp->stats.aen[h->type][NCSI_PKT_STAT_ERROR]++;
+#endif
netdev_warn(ndp->ndev.dev, "Invalid AEN (0x%x) received\n",
h->type);
return -ENOENT;
}
 
ret = ncsi_validate_aen_pkt(h, nah->payload);
-   if (ret)
+   if (ret) {
+#ifdef CONFIG_NET_NCSI_DEBUG
+   ndp->stats.aen[h->type][NCSI_PKT_STAT_ERROR]++;
+#endif
goto out;
+   }
 
ret = nah->handler(ndp, h);
+#ifdef CONFIG_NET_NCSI_DEBUG
+   if (ret)
+   ndp->stats.aen[h->type][NCSI_PKT_STAT_ERROR]++;
+   else
+   ndp->stats.aen[h->type][NCSI_PKT_STAT_OK]++;
+#endif
 out:
consume_skb(skb);
return ret;
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index db7083b..c6b2bc5 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -323,6 +323,9 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
}
 
if (!nch) {
+#ifdef CONFIG_NET_NCSI_DEBUG
+   nca->ndp->stats.cmd[nca->type][NCSI_PKT_STAT_ERROR]++;
+#endif
netdev_err(nca->ndp->ndev.dev,
   "Cannot send packet with type 0x%02x\n", nca->type);
return -ENOENT;
@@ -331,13 +334,20 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
/* Get packet payload length and allocate the request */
nca->payload = nch->payload;
nr = ncsi_alloc_command(nca);
-   if (!nr)
+   if (!nr) {
+#ifdef CONFIG_NET_NCSI_DEBUG
+   nca->ndp->stats.cmd[nca->type][NCSI_PKT_STAT_ERROR]++;
+#endif
return -ENOMEM;
+   }
 
/* Prepare the packet */
nca->id = nr->id;
ret = nch->handler(nr->cmd, nca);
if (ret) {
+#ifdef CONFIG_NET_NCSI_DEBUG
+   nca->ndp->stats.cmd[nca->type][NCSI_PKT_STAT_ERROR]++;
+#endif
ncsi_free_request(nr);
return ret;
}
@@ -359,9 +369,15 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
skb_get(nr->cmd);
ret = dev_queue_xmit(nr->cmd);
if (ret < 0) {
+#ifdef CONFIG_NET_NCSI_DEBUG
+   nca->ndp->stats.cmd[nca->type][NCSI_PKT_STAT_ERROR]++;
+#endif

[PATCH net-next 7/8] net/ncsi: No error report on DP response to non-existing package

The issue was found from /proc/ncsi/eth0/stats. The first step
in NCSI package/channel enumeration is deselect all packages by
sending DP (Deselect Package) commands. The remote NIC replies
with response while the corresponding package isn't populated
yet and it is treated as an error wrongly.

 # cat /proc/ncsi/eth0/stats
 :
 RSP  OK   TIMEOUT  ERROR
 ===
 CIS  300
 SP   300
 DP   201

This fixes the issue by ignoring the error in DP response handler,
when the corresponding package isn't existing. With this applied,
no error reported from DP response packets.

 # cat /proc/ncsi/eth0/stats
 :
 RSP  OK   TIMEOUT  ERROR
 ===
 CIS  300
 SP   300
 DP   300

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-rsp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index 93ebe0f..095726f 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -118,7 +118,7 @@ static int ncsi_rsp_handler_dp(struct ncsi_request *nr)
ncsi_find_package_and_channel(ndp, rsp->rsp.common.channel,
  &np, NULL);
if (!np)
-   return -ENODEV;
+   return 0;
 
/* Change state of all channels attached to the package */
NCSI_FOR_EACH_CHANNEL(np, nc) {
-- 
2.7.4

[PATCH net-next 8/8] net/ncsi: Fix length of GVI response packet

The length of GVI (GetVersionInfo) response packet should be 40
instead of 36. This issue was found from /proc/ncsi/eth0/stats.

 # cat /proc/ncsi/eth0/stats
 :
 RSP  OK   TIMEOUT  ERROR
 ===
 GVI  002

With this applied, no error reported on GVI response packets:

 # cat /proc/ncsi/eth0/stats
 :
 RSP  OK   TIMEOUT  ERROR
 ===
 GVI  200

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-rsp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index 095726f..67260f3 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -951,7 +951,7 @@ static struct ncsi_rsp_handler {
{ NCSI_PKT_RSP_EGMF,4, ncsi_rsp_handler_egmf},
{ NCSI_PKT_RSP_DGMF,4, ncsi_rsp_handler_dgmf},
{ NCSI_PKT_RSP_SNFC,4, ncsi_rsp_handler_snfc},
-   { NCSI_PKT_RSP_GVI,36, ncsi_rsp_handler_gvi },
+   { NCSI_PKT_RSP_GVI,40, ncsi_rsp_handler_gvi },
{ NCSI_PKT_RSP_GC, 32, ncsi_rsp_handler_gc  },
{ NCSI_PKT_RSP_GP, -1, ncsi_rsp_handler_gp  },
{ NCSI_PKT_RSP_GCPS,  172, ncsi_rsp_handler_gcps},
-- 
2.7.4

Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Ahern

On 4/12/17 8:08 PM, David Miller wrote:
> From: David Ahern 
> Date: Wed, 12 Apr 2017 13:54:20 -0600
> 
>> packet passed to xdp seems to be messed up
> Ok, the problem is that skb->mac_len isn't set properly at this point.
> That doesn't happen until __netif_receive_skb_core().
> 
> I'm also not setting xdp.data_hard_start properly.
> 
> It should work better with this small diff:

it does.

Re: [PATCH v3 net-next RFC] Generic XDP

From: David Ahern 
Date: Wed, 12 Apr 2017 13:54:20 -0600

> packet passed to xdp seems to be messed up

Ok, the problem is that skb->mac_len isn't set properly at this point.
That doesn't happen until __netif_receive_skb_core().

I'm also not setting xdp.data_hard_start properly.

It should work better with this small diff:

diff --git a/net/core/dev.c b/net/core/dev.c
index 9ed4569..d36ae8f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4289,6 +4289,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
u32 act = XDP_DROP;
void *orig_data;
int hlen, off;
+   u32 mac_len;
 
if (skb_linearize(skb))
goto do_drop;
@@ -4296,10 +4297,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
*skb,
/* The XDP program wants to see the packet starting at the MAC
 * header.
 */
-   hlen = skb_headlen(skb) + skb->mac_len;
-   xdp.data = skb->data - skb->mac_len;
+   mac_len = skb->data - skb_mac_header(skb);
+   hlen = skb_headlen(skb) + mac_len;
+   xdp.data = skb->data - mac_len;
xdp.data_end = xdp.data + hlen;
-   xdp.data_hard_start = xdp.data - skb_headroom(skb);
+   xdp.data_hard_start = skb->data - skb_headroom(skb);
orig_data = xdp.data;
 
act = bpf_prog_run_xdp(xdp_prog, &xdp);

Re: [PATCH net-next v2] net: allow configuring default qdisc

From: Stephen Hemminger 
Date: Wed, 12 Apr 2017 15:59:25 -0700

> The most recent example was in the block layer.

Linus complained about this, loudly.

Re: [PATCH v3 net-next RFC] Generic XDP

From: Stephen Hemminger 
Date: Wed, 12 Apr 2017 14:30:37 -0700

> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
> David Miller  wrote:
> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index b0aa089..071a58b 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1891,9 +1891,17 @@ struct net_device {
>>  struct lock_class_key   *qdisc_tx_busylock;
>>  struct lock_class_key   *qdisc_running_key;
>>  boolproto_down;
>> +struct bpf_prog __rcu   *xdp_prog;
> 
> It would be good if all devices could reuse this for the xdp_prog pointer.
> It would allow for could be used for introspection utility functions in 
> future.

We plan to do so.

Re: [PATCH v3 net-next RFC] Generic XDP

From: Eric Dumazet 
Date: Wed, 12 Apr 2017 14:49:53 -0700

> On Wed, 2017-04-12 at 14:30 -0700, Stephen Hemminger wrote:
>> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
>> David Miller  wrote:
>> 
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index b0aa089..071a58b 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -1891,9 +1891,17 @@ struct net_device {
>> >struct lock_class_key   *qdisc_tx_busylock;
>> >struct lock_class_key   *qdisc_running_key;
>> >boolproto_down;
>> > +  struct bpf_prog __rcu   *xdp_prog;
>> 
>> It would be good if all devices could reuse this for the xdp_prog pointer.
>> It would allow for could be used for introspection utility functions in 
>> future.
> 
> Problem is that some xdp usages were envisioning a per RX queue xdp
> program.

True, but that hasn't materialized yet so designing for it so soon
doesn't make a lot of sense.

[PATCH iproute2] ip vrf: Add command name next to pid

2017-04-12 Thread David Ahern

'ip vrf pids' is used to list processes bound to a vrf, but it only
shows the pid leaving a lot of work for the user. Add the command
name to the output. With this patch you get the more user friendly:

$ ip vrf pids mgmt
 1121  ntpd
 1418  gdm-session-wor
 1488  gnome-session
 1491  dbus-launch
 1492  dbus-daemon
 1565  sshd
 ...

Signed-off-by: David Ahern 
---
 include/utils.h |  1 +
 ip/ipvrf.c  | 24 ++--
 lib/fs.c| 42 ++
 3 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 22369e0b4e03..d66784674633 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -260,5 +260,6 @@ int get_real_family(int rtm_type, int rtm_family);
 int cmd_exec(const char *cmd, char **argv, bool do_fork);
 int make_path(const char *path, mode_t mode);
 char *find_cgroup2_mount(void);
+int get_comm(char *pid, char *comm, int len);
 
 #endif /* __UTILS_H__ */
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
index 5e204a9ebbb1..6d4bd6206cc8 100644
--- a/ip/ipvrf.c
+++ b/ip/ipvrf.c
@@ -111,27 +111,31 @@ static void read_cgroup_pids(const char *base_path, char 
*name)
 {
char path[PATH_MAX];
char buf[4096];
-   ssize_t n;
-   int fd;
+   FILE *fp;
 
if (snprintf(path, sizeof(path), "%s/vrf/%s%s",
 base_path, name, CGRP_PROC_FILE) >= sizeof(path))
return;
 
-   fd = open(path, O_RDONLY);
-   if (fd < 0)
+   fp = fopen(path, "r");
+   if (!fp)
return; /* no cgroup file, nothing to show */
 
/* dump contents (pids) of cgroup.procs */
-   while (1) {
-   n = read(fd, buf, sizeof(buf) - 1);
-   if (n <= 0)
-   break;
+   while (fgets(buf, sizeof(buf), fp)) {
+   char *nl, comm[32];
 
-   printf("%s", buf);
+   nl = strchr(buf, '\n');
+   if (nl)
+   *nl = '\0';
+
+   if (get_comm(buf, comm, sizeof(comm)))
+   strcpy(comm, "");
+
+   printf("%5s  %s\n", buf, comm);
}
 
-   close(fd);
+   fclose(fp);
 }
 
 /* recurse path looking for PATH[/NETNS]/vrf/NAME */
diff --git a/lib/fs.c b/lib/fs.c
index 12a4657a0bc9..81298f0ad91a 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -149,3 +150,44 @@ int make_path(const char *path, mode_t mode)
 
return rc;
 }
+
+int get_comm(char *pid, char *comm, int len)
+{
+   char path[PATH_MAX];
+   char line[128];
+   FILE *fp;
+
+   if (snprintf(path, sizeof(path),
+"/proc/%s/status", pid) >= sizeof(path)) {
+   return -1;
+   }
+
+   fp = fopen(path, "r");
+   if (!fp)
+   return -1;
+
+   comm[0] = '\0';
+   while (fgets(line, sizeof(line), fp)) {
+   char *nl, *name;
+
+   name = strstr(line, "Name:");
+   if (!name)
+   continue;
+
+   name += 5;
+   while (isspace(*name))
+   name++;
+
+   nl = strchr(name, '\n');
+   if (nl)
+   *nl = '\0';
+
+   strncpy(comm, name, len - 1);
+   comm[len - 1] = '\0';
+   break;
+   }
+
+   fclose(fp);
+
+   return 0;
+}
-- 
2.1.4

[PATCH net-next] l2tp: device MTU setup, tunnel socket needs a lock

2017-04-12 Thread R. Parameswaran


The MTU overhead calculation in L2TP device set-up
merged via commit b784e7ebfce8cfb16c6f95e14e8532d0768ab7ff
needs to be adjusted to lock the tunnel socket while
referencing the sub-data structures to derive the
socket's IP overhead.

Reported-by: Guillaume Nault 
Tested-by: Guillaume Nault 
Signed-off-by: R. Parameswaran 
---
 include/linux/net.h | 2 +-
 net/l2tp/l2tp_eth.c | 2 ++
 net/socket.c| 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index a42fab2..abcfa46 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -298,7 +298,7 @@ int kernel_sendpage(struct socket *sock, struct page *page, 
int offset,
 int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
 
-/* Following routine returns the IP overhead imposed by a socket.  */
+/* Routine returns the IP overhead imposed by a (caller-protected) socket. */
 u32 kernel_sock_ip_overhead(struct sock *sk);
 
 #define MODULE_ALIAS_NETPROTO(proto) \
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 138566a..b722d55 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -225,7 +225,9 @@ static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
dev->needed_headroom += session->hdr_len;
return;
}
+   lock_sock(tunnel->sock);
l3_overhead = kernel_sock_ip_overhead(tunnel->sock);
+   release_sock(tunnel->sock);
if (l3_overhead == 0) {
/* L3 Overhead couldn't be identified, this could be
 * because tunnel->sock was NULL or the socket's
diff --git a/net/socket.c b/net/socket.c
index eea9970..c2564eb 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3360,7 +3360,7 @@ EXPORT_SYMBOL(kernel_sock_shutdown);
 /* This routine returns the IP overhead imposed by a socket i.e.
  * the length of the underlying IP header, depending on whether
  * this is an IPv4 or IPv6 socket and the length from IP options turned
- * on at the socket.
+ * on at the socket. Assumes that the caller has a lock on the socket.
  */
 u32 kernel_sock_ip_overhead(struct sock *sk)
 {
-- 
2.1.4

Re: [RFC PATCH linux 0/2] net sched actions: access to uninitialized data and error handling

On Wed, Apr 12, 2017 at 7:21 AM, Wolfgang Bumiller
 wrote:
> Commit 1045ba77a ("net sched actions: Add support for user cookies")
> added code to net/sched/act_api.c's tcf_action_init_1 using the `tb`
> nlattr array unconditionally, while it was otherwise used as well as
> initialized only when `name == NULL`:
>
> if (name == NULL) {
> err = nla_parse_nested(tb, TCA_ACT_MAX, nla, NULL);
>
> In the other case `nla` is instead passed over to ->init to be parsed
> there (using a different set of TCA_ enum values, iow. TCA_ACT_COOKIE
> then "clashes" with some other value). This lead to the following three
> example commands resulting in errors (sometimes followed by more traces
> and hangups some time later (although the hangups happened seconds or
> sometimes minutes later, sometimes not at all - results differed between
> different kernel versions (linux git-master vs ubuntu's mainline 4.11
> rc6 vs. pve 4.10.5 (based off ubuntu's zesty kernel where the commit is
> cherry-picked)...))):


Makes sense.

>
>  # ip link add ve0 type veth peer name ve0b
>  # tc qdisc add dev ve0 handle : ingress
>  # tc filter add dev ve0 parent : prio 50 basic police rate 1000bps burst 
> 1000b drop
>
> The 3rd command would sometimes succeed, sometimes error with:
>
>  RTNETLINK answers: Invalid argument
>  We have an error talking to the kernel
>
> and sometimes error with:
>
>  RTNETLINK answers: Cannot allocate memory
>  We have an error talking to the kernel
>
> In the latter case I assume `cklen` became negative, which passes the
> TC_COOKIE_MAX_SIZE check since it is signed but becomes unsigned later
> in kmemdup() (see the crash dump below)


Yeah because tb[] contains some random pointers when not initialized.

>
> When the `tc filter add` command fails a backtrace shows up in dmesg,
> added below.
>
> I'm not sure why the TC_ACT_COOKIE code was added to tcf_action_init_1
> where it is now. It makes me think that it's supposed to be available
> universally, but the `name == NULL` check for how nla is used or passed
> to ->init() shows that the there are various different TC_ACT_* enums in
> use at this point, hence the 'RFC' part of the patches, I'm not that
> familiar with the code yet.
>

According to commit 1045ba77a5962a22bce67, it is generic,
but if we need it for act_police too, we should add it to TCA_POLICE*.

Thanks.

Re: [PATCH 1/8] ftgmac100: Add ethtool n-way reset call

On Wed, 2017-04-12 at 17:00 -0700, Florian Fainelli wrote:
> > -static int ftgmac100_nway_reset(struct net_device *ndev)
> > +static int ftgmac100_nway_reset(struct net_device *netdev)
> >   {
> > - if (!ndev->phydev)
> > + if (!netdev->phydev)
> >    return -ENXIO;
> > - return phy_start_aneg(ndev->phydev);
> > + return phy_start_aneg(netdev->phydev);
> 
> Can you use phy_ethtool_nway_reset() which does that (and also checks
> if phydev->drv is NULL which would be the case after an unbind).

Ah sure, I didn't notice that one, grepped the wrong driver :-)

I'll respin later today.

Thanks !

Cheers,
Ben.

Re: ney/key: slab-out-of-bounds in parse_ipsecrequests

On Wed, Apr 12, 2017 at 8:02 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit 39da7c509acff13fc8cb12ec1bb20337c988ed36 (4.11-rc6).
>
> A reproducer and .config are attached.
>
> When subtracting rq->sadb_x_ipsecrequest_len from len it can become
> negative and the while loop condition remains true.

Good catch! Seems the fix is pretty straight forward:

diff --git a/net/key/af_key.c b/net/key/af_key.c
index c6252ed..cbce595 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1945,7 +1945,7 @@ parse_ipsecrequests(struct xfrm_policy *xp,
struct sadb_x_policy *pol)
if (pol->sadb_x_policy_len * 8 < sizeof(struct sadb_x_policy))
return -EINVAL;

-   while (len >= sizeof(struct sadb_x_ipsecrequest)) {
+   while (len >= (int)sizeof(struct sadb_x_ipsecrequest)) {
if ((err = parse_ipsecrequest(xp, rq)) < 0)
return err;
len -= rq->sadb_x_ipsecrequest_len;

But pol->sadb_x_policy_len and rq->sadb_x_ipsecrequest_len
are controllable by user (fortunately root), I am feeling there might
be other problem I miss too.

Re: [PATCH 1/8] ftgmac100: Add ethtool n-way reset call

2017-04-12 Thread Florian Fainelli

On 04/12/2017 03:44 PM, Benjamin Herrenschmidt wrote:
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 796b37e..bbeb8e7 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -1043,11 +1043,11 @@ static void ftgmac100_get_drvinfo(struct net_device 
> *netdev,
>   strlcpy(info->bus_info, dev_name(&netdev->dev), sizeof(info->bus_info));
>  }
>  
> -static int ftgmac100_nway_reset(struct net_device *ndev)
> +static int ftgmac100_nway_reset(struct net_device *netdev)
>  {
> - if (!ndev->phydev)
> + if (!netdev->phydev)
>   return -ENXIO;
> - return phy_start_aneg(ndev->phydev);
> + return phy_start_aneg(netdev->phydev);

Can you use phy_ethtool_nway_reset() which does that (and also checks if
phydev->drv is NULL which would be the case after an unbind).

>  }
>  
>  static void ftgmac100_get_ringparam(struct net_device *netdev,
> @@ -1088,6 +1088,7 @@ static const struct ethtool_ops ftgmac100_ethtool_ops = 
> {
>   .get_link   = ethtool_op_get_link,
>   .get_link_ksettings = phy_ethtool_get_link_ksettings,
>   .set_link_ksettings = phy_ethtool_set_link_ksettings,
> + .nway_reset = ftgmac100_nway_reset,
>   .get_ringparam  = ftgmac100_get_ringparam,
>   .set_ringparam  = ftgmac100_set_ringparam,
>  };
> 


-- 
Florian

Re: [RFC net-next] of: mdio: Honor hints from MDIO bus drivers

2017-04-12 Thread Florian Fainelli

On 04/12/2017 03:10 PM, Andrew Lunn wrote:
> To give some more background and rational for this change.
>
> On a platform where we have a parent MDIO bus, backed by the
> mdio-bcm-unimac.c driver, we also register a slave MII bus (through
> net/dsa/dsa2.c) which is parented to this UniMAC MDIO bus through an
> assignment of of_node. This slave MII bus is created in order to
> intercept reads/writes to problematic addresses (e.g: that clashes with
> another piece of hardware).
>
> This means that the slave DSA MII bus inherits all child nodes from the
> originating master MII bus. This also means that when the slave MII bus
> is probed via of_mdiobus_register(), we probe the same devices twice:
> once through the master, another time through the slave.

 Ah, O.K. This makes more sense. On the hardware i have, we get three
 deep in MDIO busses. We have the FEC mdio bus. On top of that we have
 a gpio-mux-mdio, and on top of that we have the mv88e6xxx mdio
 bus. And i've never seen issues.

 So your real problem here is you have two mdio busses using the same
 device tree properties. I would actually say that is just plain
 broken.
>>>
>>> From a Device Tree/HW representation perspective, we do have the
>>> external BCM53125 switch physically attached to the 7445/7278
>>> SWITCH_MDIO bus (backed by mdio-bcm-unimac) so in that regard the
>>> representation is correct. There is also an integrated Gigabit PHY
>>> (bcm7xxx) which is attached to that bus.
> 
> This is made harder by you talking about a board which does not appear
> to have its DT file in mainline. So i'm having to guess what it looks
> like.

The DT binding is in tree and provides an example of how the switch
looks like, below is the example, but I am also adding the MDIO bus and
the PHYs just so you can see how things wind up:

switch_top@f0b0 {
compatible = "simple-bus";
#size-cells = <1>;
#address-cells = <1>;
ranges = <0 0xf0b0 0x40804>;

ethernet_switch@0 {
compatible = "brcm,bcm7445-switch-v4.0";
#size-cells = <0>;
#address-cells = <1>;
reg = <0x0 0x4
0x4 0x110
0x40340 0x30
0x40380 0x30
0x40400 0x34
0x40600 0x208>;
reg-names = "core", "reg", intrl2_0", "intrl2_1",
"fcb, "acb";
interrupts = <0 0x18 0
0 0x19 0>;
brcm,num-gphy = <1>;
brcm,num-rgmii-ports = <2>;
brcm,fcb-pause-override;
brcm,acb-packets-inflight;

ports {
#address-cells = <1>;
#size-cells = <0>;

port@0 {
label = "gphy";
reg = <0>;
phy-handle = <&phy5>;
};

sw0port1: port@1 {
label = "rgmii_1";
reg = <1>;
phy-mode = "rgmii";
fixed-link {
speed = <1000>;
full-duplex;
};
}
};
};

mdio@403c0 {
reg = <0x403c0 0x8 0x40300 0x18>;
#address-cells = <0x1>;
#size-cells = <0x0>;
compatible = "brcm,unimac-mdio";
reg-names = "mdio", "mdio_indir_rw";

switch: switch@0 {
broken-turn-around;
reg = <0x0>;
compatible = "brcm,bcm53125";
#address-cells = <1>;
#size-cells = <0>;

ports {
..
port@8 {
ethernet = <&sw0port1>;
};
...
};
};

phy5: ethernet-phy@5 {
reg = <0x5>;
compatible = "ethernet-phy-ieee802.3-c22";
};
};
};


> 
> So what i think we are talking about is this bit of code:
> 
> static int bcm_sf2_mdio_register(struct dsa_switch *ds)
> {
> struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
> struct device_node *dn;
> static int index;
> int err;
> 
> /* Find our integrated MDIO bus node */
> dn = of_find_compatible_node(NULL, NULL, "brcm,unimac-mdio");
> priv->master_mii_bus = of_mdio_find_bus(dn);
> if (!pr

Re: [patch iproute2/net-next repost] devlink: Add support for pipeline debug (dpipe)

2017-04-12 Thread Stephen Hemminger

On Tue, 28 Mar 2017 17:26:54 +0200
Jiri Pirko  wrote:

>  
>  #define pr_err(args...) fprintf(stderr, ##args)
> -#define pr_out(args...) fprintf(stdout, ##args)
> +#define pr_out(args...)  \
> + do {\
> + if (g_indent_newline) { \
> + fprintf(stdout, "%s", g_indent_str);\
> + g_indent_newline = false;   \
> + }   \
> + fprintf(stdout, ##args);\
> + } while (0)
> +
>  #define pr_out_sp(num, args...)  \
>   do {\
>   int ret = fprintf(stdout, ##args);  \
> @@ -42,6 +50,35 @@
>   fprintf(stdout, "%*s", num - ret, "");  \
>   } while (0)
>  
> +static int g_indent_level;
> +static bool g_indent_newline;
> +#define INDENT_STR_STEP 2
> +#define INDENT_STR_MAXLEN 32
> +static char g_indent_str[INDENT_STR_MAXLEN + 1] = "";
> +
> +static void __pr_out_indent_inc(void)
> +{
> + if (g_indent_level + INDENT_STR_STEP > INDENT_STR_MAXLEN)
> + return;
> + g_indent_level += INDENT_STR_STEP;
> + memset(g_indent_str, ' ', sizeof(g_indent_str));
> + g_indent_str[g_indent_level] = '\0';
> +}
> +
> +static void __pr_out_indent_dec(void)
> +{
> + if (g_indent_level - INDENT_STR_STEP < 0)
> + return;
> + g_indent_level -= INDENT_STR_STEP;
> + g_indent_str[g_indent_level] = '\0';
> +}
> +
> +static void __pr_out_newline(void)
> +{
> + pr_out("\n");
> + g_indent_newline = true;
> +}
> +

Thanks for adding the support. Like many reviews, I am fine with the
functionality but it is the details of the implementation that are of
concern.

Why this new set of output formatting routines, this doesn't resemble other code
in ip utilities. Looks more like you copied and pasted it from somewhere else.
The indentation in existing ip command output isn't pretty or fancy but it 
works.

Please try and make this code look like all the other code. Yes, I know it
may not be to your taste (it isn't mine either), but consistency is more 
important
than individual style.

> @ -314,6 +356,102 @@ static int attr_cb(const struct nlattr *attr, void *data)
>   if (type == DEVLINK_ATTR_ESWITCH_INLINE_MODE &&
>   mnl_attr_validate(attr, MNL_TYPE_U8) < 0)
>   return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLES &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLE &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLE_NAME &&
> + mnl_attr_validate(attr, MNL_TYPE_STRING) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLE_SIZE &&
> + mnl_attr_validate(attr, MNL_TYPE_U64) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLE_MATCHES &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLE_ACTIONS &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_TABLE_COUNTERS_ENABLED &&
> + mnl_attr_validate(attr, MNL_TYPE_U8) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ENTRIES &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ENTRY &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ENTRY_INDEX &&
> + mnl_attr_validate(attr, MNL_TYPE_U64) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ENTRY_MATCH_VALUES &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ENTRY_ACTION_VALUES &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ENTRY_COUNTER &&
> + mnl_attr_validate(attr, MNL_TYPE_U64) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_MATCH &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_MATCH_VALUE &&
> + mnl_attr_validate(attr, MNL_TYPE_NESTED) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_MATCH_TYPE &&
> + mnl_attr_validate(attr, MNL_TYPE_U32) < 0)
> + return MNL_CB_ERROR;
> + if (type == DEVLINK_ATTR_DPIPE_ACTION &&
> +

Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg

2017-04-12 Thread Willem de Bruijn

On Wed, Apr 12, 2017 at 6:25 PM, Willem de Bruijn
 wrote:
> On Wed, Apr 12, 2017 at 4:47 PM, Eric Dumazet  wrote:
>> On Wed, 2017-04-12 at 13:07 -0700, Cong Wang wrote:
>>> On Wed, Apr 12, 2017 at 8:39 AM, Willem de Bruijn
>>>  wrote:
>>> > ===
>>> >> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg
>>> >> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128
>>> >
>>> > Thanks for the report. This is accessing skb->dev from within recvmsg() 
>>> > at line
>>> >
>>> > info->ipi_ifindex = skb->dev->ifindex;
>>> >
>>> > Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on
>>> > errqueue with origin tstamp"). At this time the device may indeed have
>>> > gone away. I'm having a look at a way to read this in the receive BH
>>> > and store the ifindex.
>>>
>>> Why not use skb_iif?
>
> This code is called from the error path for transmit timestamps.
>
> We can make use of the fact that SKB_EXT_ERR used on enqueue has iif as
> the first field in its control block. This also holds for the PKTINFO_SKB_CB
> struct to which skb->cb is cast on dequeue when it copies pktinfo to 
> userspace.
> So if set on enqueue in __skb_complete_tx_timestamp, no conversion operation
> is even needed on dequeue, let alone the currently buggy line that touches
> skb->dev.
>
> This iif cast was added for this purpose in the receive path in 0b922b7a829c
> ("net: original ingress device index in PKTINFO").
>
> The device pointer is valid on enqueue for all paths called from device 
> drivers,
> as well as from dev_queue_xmit for SCM_TSTAMP_SCHED generation in
> __dev_queue_xmit. The exception is SCM_TSTAMP_ACK generation, but
> there skb->dev is NULL.
>
> The v6 path does need a conversion, but already does this in
> ip6_datagram_recv_common_ctl. There, too, we can remove the buggy
> logic to set it from skb->dev->ifindex in ip6_datagram_support_cmsg.
>
> I will send a patch.

Sent http://patchwork.ozlabs.org/patch/750197

[PATCH net] net-timestamp: avoid use-after-free in ip_recv_error

2017-04-12 Thread Willem de Bruijn

From: Willem de Bruijn 

Syzkaller reported a use-after-free in ip_recv_error at line

info->ipi_ifindex = skb->dev->ifindex;

This function is called on dequeue from the error queue, at which
point the device pointer may no longer be valid.

Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
pointer is valid or NULL. Store it in temporary storage skb->cb.

It is safe to reference skb->dev here, as called from device drivers
or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
in that case it is NULL and ifindex is set to 0 (invalid).

Do not return a pktinfo cmsg if ifindex is 0. This maintains the
current behavior of not returning a cmsg if skb->dev was NULL.

On dequeue, the ipv4 path will cast from sock_exterr_skb to
in_pktinfo. Both have ifindex as their first element, so no explicit
conversion is needed. This is by design, introduced in commit
0b922b7a829c ("net: original ingress device index in PKTINFO"). For
ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.

Fixes: 829ae9d61165 ("net-timestamp: allow reading recv cmsg on errqueue with 
origin tstamp")

Reported-by: Andrey Konovalov 
Signed-off-by: Willem de Bruijn 
---
 net/core/skbuff.c  |  1 +
 net/ipv4/ip_sockglue.c |  9 -
 net/ipv6/datagram.c| 10 +-
 3 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9f781092fda9..35c1e2460206 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3807,6 +3807,7 @@ static void __skb_complete_tx_timestamp(struct sk_buff 
*skb,
serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
serr->ee.ee_info = tstype;
serr->opt_stats = opt_stats;
+   serr->header.h4.iif = skb->dev ? skb->dev->ifindex : 0;
if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
serr->ee.ee_data = skb_shinfo(skb)->tskey;
if (sk->sk_protocol == IPPROTO_TCP &&
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index ebd953bc5607..35076792caa5 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -488,16 +488,15 @@ static bool ipv4_datagram_support_cmsg(const struct sock 
*sk,
return false;
 
/* Support IP_PKTINFO on tstamp packets if requested, to correlate
-* timestamp with egress dev. Not possible for packets without dev
+* timestamp with egress dev. Not possible for packets without iif
 * or without payload (SOF_TIMESTAMPING_OPT_TSONLY).
 */
-   if ((!(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_CMSG)) ||
-   (!skb->dev))
+   info = PKTINFO_SKB_CB(skb);
+   if (!(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_CMSG) ||
+   !info->ipi_ifindex)
return false;
 
-   info = PKTINFO_SKB_CB(skb);
info->ipi_spec_dst.s_addr = ip_hdr(skb)->saddr;
-   info->ipi_ifindex = skb->dev->ifindex;
return true;
 }
 
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index eec27f87efac..e011122ebd43 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -405,9 +405,6 @@ static inline bool ipv6_datagram_support_addr(struct 
sock_exterr_skb *serr)
  * At one point, excluding local errors was a quick test to identify icmp/icmp6
  * errors. This is no longer true, but the test remained, so the v6 stack,
  * unlike v4, also honors cmsg requests on all wifi and timestamp errors.
- *
- * Timestamp code paths do not initialize the fields expected by cmsg:
- * the PKTINFO fields in skb->cb[]. Fill those in here.
  */
 static bool ip6_datagram_support_cmsg(struct sk_buff *skb,
  struct sock_exterr_skb *serr)
@@ -419,14 +416,9 @@ static bool ip6_datagram_support_cmsg(struct sk_buff *skb,
if (serr->ee.ee_origin == SO_EE_ORIGIN_LOCAL)
return false;
 
-   if (!skb->dev)
+   if (!IP6CB(skb)->iif)
return false;
 
-   if (skb->protocol == htons(ETH_P_IPV6))
-   IP6CB(skb)->iif = skb->dev->ifindex;
-   else
-   PKTINFO_SKB_CB(skb)->ipi_ifindex = skb->dev->ifindex;
-
return true;
 }
 
-- 
2.12.2.715.g7642488e1d-goog

Re: [PATCH net-next v2] net: allow configuring default qdisc

2017-04-12 Thread Stephen Hemminger

On Tue, 11 Apr 2017 22:09:59 -0400 (EDT)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Sat,  8 Apr 2017 15:39:46 -0400
> 
> > Since 3.12 it has been possible to configure the default queuing
> > discipline via sysctl. This patch adds ability to configure the
> > default queue discipline in kernel configuration. This is useful for
> > environments where configuring the value from userspace is difficult
> > to manage.
> > 
> > The default is still the same as before (pfifo_fast) and it is
> > possible to change after kernel init with sysctl. This is similar
> > to how TCP congestion control works.
> > 
> > Signed-off-by: Stephen Hemminger 
> > ---
> > v2 - rearrange order of menu items
> >  use pfifo_fast not pfifo  
> 
> Stephen I'm still thinking about this.
> 
> Normal users typing "make oldconfig" shouldn't be asked a question
> like this.  They have no idea.  This is just like how we hide all the
> advanced ipv4 routing choices behind the Kconfig symbol
> IP_ADVANCED_ROUTER and TCP congestion control with TCP_CONG_ADVANCED.
> 
> And from experience I know that if it shows up when Linus types 'make'
> after pulling this in, he will complain :-)


I understand your concern about less clueful users not knowing what
to answer. But there have been many recent cases where things have
been added with the adage "just hit return and take the default".

The most recent example was in the block layer.

commit d34849913819a5e0cbfbe724dbe79df89278c524
Author: Jens Axboe 
Date:   Fri Jan 13 14:43:58 2017 -0700

blk-mq-sched: allow setting of default IO scheduler

Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe 
Reviewed-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 

Maybe Linus understands IO better than networking :=)

[PATCH v2] net: phy: micrel: fix crash when statistic requested for KSZ9031 phy

2017-04-12 Thread Grygorii Strashko

Now the command:
ethtool --phy-statistics eth0
will cause system crash with meassage "Unable to handle kernel NULL pointer
dereference at virtual address 0010" from:

 (kszphy_get_stats) from [] (ethtool_get_phy_stats+0xd8/0x210)
 (ethtool_get_phy_stats) from [] (dev_ethtool+0x5b8/0x228c)
 (dev_ethtool) from [] (dev_ioctl+0x3fc/0x964)
 (dev_ioctl) from [] (sock_ioctl+0x170/0x2c0)
 (sock_ioctl) from [] (do_vfs_ioctl+0xa8/0x95c)
 (do_vfs_ioctl) from [] (SyS_ioctl+0x3c/0x64)
 (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x44)

The reason: phy_driver structure for KSZ9031 phy has no .probe() callback
defined. As result, struct phy_device *phydev->priv pointer will not be
initializes (null).
This issue will affect also following phys:
 KSZ8795, KSZ886X, KSZ8873MLL, KSZ9031, KSZ9021, KSZ8061, KS8737

Fix it by:
- adding .probe() = kszphy_probe() callback to KSZ9031, KSZ9021
phys. The kszphy_probe() can be re-used as it doesn't do any phy specific
settings.
- removing statistic callbacks from other phys (KSZ8795, KSZ886X,
KSZ8873MLL, KSZ8061, KS8737) as they doesn't have corresponding
statistic counters.

Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
Signed-off-by: Grygorii Strashko 
---
changes in v2:
 - probe callback added to KSZ9031, KSZ9021
 - statistic callback removed from KSZ8795, KSZ886X, KSZ8873MLL, KSZ8061, KS8737

Link on v1:
 https://lkml.org/lkml/2017/4/10/1183

 drivers/net/phy/micrel.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 6742070..6f207e6 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -574,7 +574,6 @@ static int ksz9031_config_init(struct phy_device *phydev)
MII_KSZ9031RN_TX_DATA_PAD_SKEW, 4,
tx_data_skews, 4);
}
-
return ksz9031_center_flp_timing(phydev);
 }
 
@@ -798,9 +797,6 @@ static struct phy_driver ksphy_driver[] = {
.read_status= genphy_read_status,
.ack_interrupt  = kszphy_ack_interrupt,
.config_intr= kszphy_config_intr,
-   .get_sset_count = kszphy_get_sset_count,
-   .get_strings= kszphy_get_strings,
-   .get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
.resume = genphy_resume,
 }, {
@@ -940,9 +936,6 @@ static struct phy_driver ksphy_driver[] = {
.read_status= genphy_read_status,
.ack_interrupt  = kszphy_ack_interrupt,
.config_intr= kszphy_config_intr,
-   .get_sset_count = kszphy_get_sset_count,
-   .get_strings= kszphy_get_strings,
-   .get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
.resume = genphy_resume,
 }, {
@@ -952,6 +945,7 @@ static struct phy_driver ksphy_driver[] = {
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.driver_data= &ksz9021_type,
+   .probe  = kszphy_probe,
.config_init= ksz9021_config_init,
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
@@ -971,6 +965,7 @@ static struct phy_driver ksphy_driver[] = {
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.driver_data= &ksz9021_type,
+   .probe  = kszphy_probe,
.config_init= ksz9031_config_init,
.config_aneg= genphy_config_aneg,
.read_status= ksz9031_read_status,
@@ -989,9 +984,6 @@ static struct phy_driver ksphy_driver[] = {
.config_init= kszphy_config_init,
.config_aneg= ksz8873mll_config_aneg,
.read_status= ksz8873mll_read_status,
-   .get_sset_count = kszphy_get_sset_count,
-   .get_strings= kszphy_get_strings,
-   .get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
.resume = genphy_resume,
 }, {
@@ -1003,9 +995,6 @@ static struct phy_driver ksphy_driver[] = {
.config_init= kszphy_config_init,
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
-   .get_sset_count = kszphy_get_sset_count,
-   .get_strings= kszphy_get_strings,
-   .get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
.resume = genphy_resume,
 }, {
@@ -1017,9 +1006,6 @@ static struct phy_driver ksphy_driver[] = {
.config_init= kszphy_config_init,
.config_aneg= ksz8873mll_config_aneg,
.read_status= ksz8873mll_read_status,
-   .get_sset_count = kszphy_get_sset_count,
-   .get_strings= kszphy_get_strings,
-   .get_stats  = kszphy_get_stats,
.suspend= genphy_suspend,
.resume = genphy_resume,
 } };
-- 
2.10.1

Re: [PATCH nf-next] ip_vs_sync: change comparison on sync_refresh_period

2017-04-12 Thread Simon Horman

On Wed, Apr 12, 2017 at 04:38:12PM -0400, Aaron Conole wrote:
> The sync_refresh_period variable is unsigned, so it can never be < 0.
> 
> Signed-off-by: Aaron Conole 

Thanks Aaron,

I have applied this to ipvs-next after updating the prefix to "ipvs:".

[PATCH 8/8] ftgmac100: Document device-tree binding

Signed-off-by: Benjamin Herrenschmidt 
---
 .../devicetree/bindings/net/ftgmac100.txt  | 36 ++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/ftgmac100.txt

diff --git a/Documentation/devicetree/bindings/net/ftgmac100.txt 
b/Documentation/devicetree/bindings/net/ftgmac100.txt
new file mode 100644
index 000..68a694a
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/ftgmac100.txt
@@ -0,0 +1,36 @@
+* Faraday Technology FTGMAC100 gigabit ethernet controller
+
+Required properties:
+- compatible: "faraday,ftgmac100"
+
+  Must also contain one of these if used as part of an Aspeed AST2400
+  or 2500 family SoC as they have some subtle tweaks to the
+  implementation:
+
+ - "aspeed,ast2400-mac"
+ - "aspeed,ast2500-mac"
+
+- reg: Address and length of the register set for the device
+- interrupts: Should contain ethernet controller interrupt
+
+Optional properties:
+- phy-mode: See ethernet.txt file in the same directory. If the property is
+  absent, "rgmii" is assumed. Supported values are "rgmii" and "rmii"
+- use-ncsi: Use the NC-SI stack instead of an MDIO PHY. Currently assumes
+  rmii (100bT) but kept as a separate property in case NC-SI grows support
+  for a gigabit link.
+- no-hw-checksum: Used to disable HW checksum support. Here for backward
+  compatibility as the driver now should have correct defaults based on
+  the SoC.
+
+Example:
+
+   mac0: ethernet@1e66 {
+   compatible = "aspeed,ast2500-mac", "faraday,ftgmac100";
+   reg = <0x1e66 0x180>;
+   interrupts = <2>;
+   status = "okay";
+   use-ncsi;
+   };
+
+
-- 
2.9.3

[PATCH 5/8] ftgmac100: Add netpoll support

Just call the interrupt handler with interrupts locally disabled

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index ded3447..e71b9c4 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1593,6 +1593,17 @@ static int ftgmac100_set_features(struct net_device 
*netdev,
return 0;
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void ftgmac100_poll_controller(struct net_device *netdev)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   ftgmac100_interrupt(netdev->irq, netdev);
+   local_irq_restore(flags);
+}
+#endif
+
 static const struct net_device_ops ftgmac100_netdev_ops = {
.ndo_open   = ftgmac100_open,
.ndo_stop   = ftgmac100_stop,
@@ -1603,6 +1614,9 @@ static const struct net_device_ops ftgmac100_netdev_ops = 
{
.ndo_tx_timeout = ftgmac100_tx_timeout,
.ndo_set_rx_mode= ftgmac100_set_rx_mode,
.ndo_set_features   = ftgmac100_set_features,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller= ftgmac100_poll_controller,
+#endif
 };
 
 static int ftgmac100_setup_mdio(struct net_device *netdev)
-- 
2.9.3

[PATCH 6/8] ftgmac100: Allow configuration of phy interface via device-tree

This uses the standard phy-mode property

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 42 +---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index e71b9c4..c1afda8 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1049,7 +1050,7 @@ static void ftgmac100_adjust_link(struct net_device 
*netdev)
schedule_work(&priv->reset_task);
 }
 
-static int ftgmac100_mii_probe(struct ftgmac100 *priv)
+static int ftgmac100_mii_probe(struct ftgmac100 *priv, phy_interface_t intf)
 {
struct net_device *netdev = priv->netdev;
struct phy_device *phydev;
@@ -1061,7 +1062,7 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv)
}
 
phydev = phy_connect(netdev, phydev_name(phydev),
-&ftgmac100_adjust_link, PHY_INTERFACE_MODE_GMII);
+&ftgmac100_adjust_link, intf);
 
if (IS_ERR(phydev)) {
netdev_err(netdev, "%s: Could not attach to PHY\n", 
netdev->name);
@@ -1623,6 +1624,8 @@ static int ftgmac100_setup_mdio(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
struct platform_device *pdev = to_platform_device(priv->dev);
+   int phy_intf = PHY_INTERFACE_MODE_RGMII;
+   struct device_node *np = pdev->dev.of_node;
int i, err = 0;
u32 reg;
 
@@ -1638,6 +1641,39 @@ static int ftgmac100_setup_mdio(struct net_device 
*netdev)
iowrite32(reg, priv->base + FTGMAC100_OFFSET_REVR);
};
 
+   /* Get PHY mode from device-tree */
+   if (np) {
+   /* Default to RGMII. It's a gigabit part after all */
+   phy_intf = of_get_phy_mode(np);
+   if (phy_intf < 0)
+   phy_intf = PHY_INTERFACE_MODE_RGMII;
+
+   /* Aspeed only supports these. I don't know about other IP
+* block vendors so I'm going to just let them through for
+* now. Note that this is only a warning if for some obscure
+* reason the DT really means to lie about it or it's a newer
+* part we don't know about.
+*
+* On the Aspeed SoC there are additionally straps and SCU
+* control bits that could tell us what the interface is
+* (or allow us to configure it while the IP block is held
+* in reset). For now I chose to keep this driver away from
+* those SoC specific bits and assume the device-tree is
+* right and the SCU has been configured properly by pinmux
+* or the firmware.
+*/
+   if (priv->is_aspeed &&
+   phy_intf != PHY_INTERFACE_MODE_RMII &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII_ID &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII_RXID &&
+   phy_intf != PHY_INTERFACE_MODE_RGMII_TXID) {
+   netdev_warn(netdev,
+  "Unsupported PHY mode %s !\n",
+  phy_modes(phy_intf));
+   }
+   }
+
priv->mii_bus->name = "ftgmac100_mdio";
snprintf(priv->mii_bus->id, MII_BUS_ID_SIZE, "%s-%d",
 pdev->name, pdev->id);
@@ -1654,7 +1690,7 @@ static int ftgmac100_setup_mdio(struct net_device *netdev)
goto err_register_mdiobus;
}
 
-   err = ftgmac100_mii_probe(priv);
+   err = ftgmac100_mii_probe(priv, phy_intf);
if (err) {
dev_err(priv->dev, "MII Probe failed!\n");
goto err_mii_probe;
-- 
2.9.3

[PATCH 4/8] ftgmac100: Add vlan HW offload

The chip supports HW vlan tag insertion and extraction. Add support
for it.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 46 +++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index d86b27f..ded3447 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -333,6 +334,10 @@ static void ftgmac100_start_hw(struct ftgmac100 *priv)
else if (netdev_mc_count(priv->netdev))
maccr |= FTGMAC100_MACCR_HT_MULTI_EN;
 
+   /* Vlan filtering enabled */
+   if (priv->netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
+   maccr |= FTGMAC100_MACCR_RM_VLAN;
+
/* Hit the HW */
iowrite32(maccr, priv->base + FTGMAC100_OFFSET_MACCR);
 }
@@ -528,6 +533,12 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, 
int *processed)
/* Transfer received size to skb */
skb_put(skb, size);
 
+   /* Extract vlan tag */
+   if ((netdev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
+   (csum_vlan & FTGMAC100_RXDES1_VLANTAG_AVAIL))
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+  csum_vlan & 0x);
+
/* Tear down DMA mapping, do necessary cache management */
map = le32_to_cpu(rxdes->rxdes3);
 
@@ -752,6 +763,13 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
if (skb->ip_summed == CHECKSUM_PARTIAL &&
!ftgmac100_prep_tx_csum(skb, &csum_vlan))
goto drop;
+
+   /* Add VLAN tag */
+   if (skb_vlan_tag_present(skb)) {
+   csum_vlan |= FTGMAC100_TXDES1_INS_VLANTAG;
+   csum_vlan |= skb_vlan_tag_get(skb) & 0x;
+   }
+
txdes->txdes1 = cpu_to_le32(csum_vlan);
 
/* Next descriptor */
@@ -1551,6 +1569,30 @@ static void ftgmac100_tx_timeout(struct net_device 
*netdev)
schedule_work(&priv->reset_task);
 }
 
+static int ftgmac100_set_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+   netdev_features_t changed = netdev->features ^ features;
+
+   if (!netif_running(netdev))
+   return 0;
+
+   /* Update the vlan filtering bit */
+   if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
+   u32 maccr;
+
+   maccr = ioread32(priv->base + FTGMAC100_OFFSET_MACCR);
+   if (priv->netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
+   maccr |= FTGMAC100_MACCR_RM_VLAN;
+   else
+   maccr &= ~FTGMAC100_MACCR_RM_VLAN;
+   iowrite32(maccr, priv->base + FTGMAC100_OFFSET_MACCR);
+   }
+
+   return 0;
+}
+
 static const struct net_device_ops ftgmac100_netdev_ops = {
.ndo_open   = ftgmac100_open,
.ndo_stop   = ftgmac100_stop,
@@ -1560,6 +1602,7 @@ static const struct net_device_ops ftgmac100_netdev_ops = 
{
.ndo_do_ioctl   = ftgmac100_do_ioctl,
.ndo_tx_timeout = ftgmac100_tx_timeout,
.ndo_set_rx_mode= ftgmac100_set_rx_mode,
+   .ndo_set_features   = ftgmac100_set_features,
 };
 
 static int ftgmac100_setup_mdio(struct net_device *netdev)
@@ -1735,7 +1778,8 @@ static int ftgmac100_probe(struct platform_device *pdev)
 
/* Base feature set */
netdev->hw_features = NETIF_F_RXCSUM | NETIF_F_HW_CSUM |
-   NETIF_F_GRO | NETIF_F_SG;
+   NETIF_F_GRO | NETIF_F_SG | NETIF_F_HW_VLAN_CTAG_RX |
+   NETIF_F_HW_VLAN_CTAG_TX;
 
/* AST2400  doesn't have working HW checksum generation */
if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
-- 
2.9.3

[PATCH 7/8] ftgmac100: Display the discovered PHY device info

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index c1afda8..d04ad31 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1075,6 +1075,9 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv, 
phy_interface_t intf)
phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
phydev->advertising = phydev->supported;
 
+   /* Display what we found */
+   phy_attached_info(phydev);
+
return 0;
 }
 
-- 
2.9.3

[PATCH 2/8] ftgmac100: Add pause frames configuration and support

Hopefully my understanding of how the hardware works is correct,
as the documentation isn't completely clear. So far I have seen
no obvious issue. Pause seem to also work with NC-SI.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 96 +++-
 drivers/net/ethernet/faraday/ftgmac100.h |  7 +++
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index bbeb8e7..4f3ec2c 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -97,6 +97,11 @@ struct ftgmac100 {
int cur_duplex;
bool use_ncsi;
 
+   /* Flow control settings */
+   bool tx_pause;
+   bool rx_pause;
+   bool aneg_pause;
+
/* Misc */
bool need_mac_restart;
bool is_aspeed;
@@ -217,6 +222,23 @@ static int ftgmac100_set_mac_addr(struct net_device *dev, 
void *p)
return 0;
 }
 
+static void ftgmac100_config_pause(struct ftgmac100 *priv)
+{
+   u32 fcr = FTGMAC100_FCR_PAUSE_TIME(16);
+
+   /* Throttle tx queue when receiving pause frames */
+   if (priv->rx_pause)
+   fcr |= FTGMAC100_FCR_FC_EN;
+
+   /* Enables sending pause frames when the RX queue is past a
+* certain threshold.
+*/
+   if (priv->tx_pause)
+   fcr |= FTGMAC100_FCR_FCTHR_EN;
+
+   iowrite32(fcr, priv->base + FTGMAC100_OFFSET_FCR);
+}
+
 static void ftgmac100_init_hw(struct ftgmac100 *priv)
 {
u32 reg, rfifo_sz, tfifo_sz;
@@ -910,6 +932,7 @@ static void ftgmac100_adjust_link(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
struct phy_device *phydev = netdev->phydev;
+   bool tx_pause, rx_pause;
int new_speed;
 
/* We store "no link" as speed 0 */
@@ -918,8 +941,21 @@ static void ftgmac100_adjust_link(struct net_device 
*netdev)
else
new_speed = phydev->speed;
 
+   /* Grab pause settings from PHY if configured to do so */
+   if (priv->aneg_pause) {
+   rx_pause = tx_pause = phydev->pause;
+   if (phydev->asym_pause)
+   tx_pause = !rx_pause;
+   } else {
+   rx_pause = priv->rx_pause;
+   tx_pause = priv->tx_pause;
+   }
+
+   /* Link hasn't changed, do nothing */
if (phydev->speed == priv->cur_speed &&
-   phydev->duplex == priv->cur_duplex)
+   phydev->duplex == priv->cur_duplex &&
+   rx_pause == priv->rx_pause &&
+   tx_pause == priv->tx_pause)
return;
 
/* Print status if we have a link or we had one and just lost it,
@@ -930,6 +966,8 @@ static void ftgmac100_adjust_link(struct net_device *netdev)
 
priv->cur_speed = new_speed;
priv->cur_duplex = phydev->duplex;
+   priv->rx_pause = rx_pause;
+   priv->tx_pause = tx_pause;
 
/* Link is down, do nothing else */
if (!new_speed)
@@ -961,6 +999,12 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv)
return PTR_ERR(phydev);
}
 
+   /* Indicate that we support PAUSE frames (see comment in
+* Documentation/networking/phy.txt)
+*/
+   phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
+   phydev->advertising = phydev->supported;
+
return 0;
 }
 
@@ -1083,6 +1127,48 @@ static int ftgmac100_set_ringparam(struct net_device 
*netdev,
return 0;
 }
 
+static void ftgmac100_get_pauseparam(struct net_device *netdev,
+struct ethtool_pauseparam *pause)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+
+   pause->autoneg = priv->aneg_pause;
+   pause->tx_pause = priv->tx_pause;
+   pause->rx_pause = priv->rx_pause;
+}
+
+static int ftgmac100_set_pauseparam(struct net_device *netdev,
+   struct ethtool_pauseparam *pause)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+   struct phy_device *phydev = netdev->phydev;
+
+   priv->aneg_pause = pause->autoneg;
+   priv->tx_pause = pause->tx_pause;
+   priv->rx_pause = pause->rx_pause;
+
+   if (phydev) {
+   phydev->advertising &= ~ADVERTISED_Pause;
+   phydev->advertising &= ~ADVERTISED_Asym_Pause;
+
+   if (pause->rx_pause) {
+   phydev->advertising |= ADVERTISED_Pause;
+   phydev->advertising |= ADVERTISED_Asym_Pause;
+   }
+
+   if (pause->tx_pause)
+   phydev->advertising ^= ADVERTISED_Asym_Pause;
+   }
+   if (netif_running(netdev)) {
+   if (phydev && priv->aneg_pause)
+   phy_start_aneg(phydev);
+   else
+   ftgmac100_config_pause(priv);
+   }
+
+   return 0;
+}
+
 static const struct ethtool

[PATCH 3/8] ftgmac100: Add ndo_set_rx_mode() and support for multicast & promisc

This adds the ndo_set_rx_mode() callback to configure the
multicast filters, promisc and allmulti options.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 4f3ec2c..d86b27f 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -97,6 +98,10 @@ struct ftgmac100 {
int cur_duplex;
bool use_ncsi;
 
+   /* Multicast filter settings */
+   u32 maht0;
+   u32 maht1;
+
/* Flow control settings */
bool tx_pause;
bool rx_pause;
@@ -264,6 +269,10 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
/* Write MAC address */
ftgmac100_write_mac_addr(priv, priv->netdev->dev_addr);
 
+   /* Write multicast filter */
+   iowrite32(priv->maht0, priv->base + FTGMAC100_OFFSET_MAHT0);
+   iowrite32(priv->maht1, priv->base + FTGMAC100_OFFSET_MAHT1);
+
/* Configure descriptor sizes and increase burst sizes according
 * to values in Aspeed SDK. The FIFO arbitration is enabled and
 * the thresholds set based on the recommended values in the
@@ -317,6 +326,12 @@ static void ftgmac100_start_hw(struct ftgmac100 *priv)
/* Add other bits as needed */
if (priv->cur_duplex == DUPLEX_FULL)
maccr |= FTGMAC100_MACCR_FULLDUP;
+   if (priv->netdev->flags & IFF_PROMISC)
+   maccr |= FTGMAC100_MACCR_RX_ALL;
+   if (priv->netdev->flags & IFF_ALLMULTI)
+   maccr |= FTGMAC100_MACCR_RX_MULTIPKT;
+   else if (netdev_mc_count(priv->netdev))
+   maccr |= FTGMAC100_MACCR_HT_MULTI_EN;
 
/* Hit the HW */
iowrite32(maccr, priv->base + FTGMAC100_OFFSET_MACCR);
@@ -327,6 +342,42 @@ static void ftgmac100_stop_hw(struct ftgmac100 *priv)
iowrite32(0, priv->base + FTGMAC100_OFFSET_MACCR);
 }
 
+static void ftgmac100_calc_mc_hash(struct ftgmac100 *priv)
+{
+   struct netdev_hw_addr *ha;
+
+   priv->maht1 = 0;
+   priv->maht0 = 0;
+   netdev_for_each_mc_addr(ha, priv->netdev) {
+   u32 crc_val = ether_crc_le(ETH_ALEN, ha->addr);
+
+   crc_val = (~(crc_val >> 2)) & 0x3f;
+   if (crc_val >= 32)
+   priv->maht1 |= 1ul << (crc_val - 32);
+   else
+   priv->maht0 |= 1ul << (crc_val);
+   }
+}
+
+static void ftgmac100_set_rx_mode(struct net_device *netdev)
+{
+   struct ftgmac100 *priv = netdev_priv(netdev);
+
+   /* Setup the hash filter */
+   ftgmac100_calc_mc_hash(priv);
+
+   /* Interface down ? that's all there is to do */
+   if (!netif_running(netdev))
+   return;
+
+   /* Update the HW */
+   iowrite32(priv->maht0, priv->base + FTGMAC100_OFFSET_MAHT0);
+   iowrite32(priv->maht1, priv->base + FTGMAC100_OFFSET_MAHT1);
+
+   /* Reconfigure MACCR */
+   ftgmac100_start_hw(priv);
+}
+
 static int ftgmac100_alloc_rx_buf(struct ftgmac100 *priv, unsigned int entry,
  struct ftgmac100_rxdes *rxdes, gfp_t gfp)
 {
@@ -1508,6 +1559,7 @@ static const struct net_device_ops ftgmac100_netdev_ops = 
{
.ndo_validate_addr  = eth_validate_addr,
.ndo_do_ioctl   = ftgmac100_do_ioctl,
.ndo_tx_timeout = ftgmac100_tx_timeout,
+   .ndo_set_rx_mode= ftgmac100_set_rx_mode,
 };
 
 static int ftgmac100_setup_mdio(struct net_device *netdev)
-- 
2.9.3

[PATCH 1/8] ftgmac100: Add ethtool n-way reset call

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 796b37e..bbeb8e7 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1043,11 +1043,11 @@ static void ftgmac100_get_drvinfo(struct net_device 
*netdev,
strlcpy(info->bus_info, dev_name(&netdev->dev), sizeof(info->bus_info));
 }
 
-static int ftgmac100_nway_reset(struct net_device *ndev)
+static int ftgmac100_nway_reset(struct net_device *netdev)
 {
-   if (!ndev->phydev)
+   if (!netdev->phydev)
return -ENXIO;
-   return phy_start_aneg(ndev->phydev);
+   return phy_start_aneg(netdev->phydev);
 }
 
 static void ftgmac100_get_ringparam(struct net_device *netdev,
@@ -1088,6 +1088,7 @@ static const struct ethtool_ops ftgmac100_ethtool_ops = {
.get_link   = ethtool_op_get_link,
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = phy_ethtool_set_link_ksettings,
+   .nway_reset = ftgmac100_nway_reset,
.get_ringparam  = ftgmac100_get_ringparam,
.set_ringparam  = ftgmac100_set_ringparam,
 };
-- 
2.9.3

[PATCH 0/8] ftgmac100: Rework batch 5 - Features

This is fifth and last batch of updates to the ftgmac100 driver.

This contains a few additional "features" such as:

 - Support for ethtool n-way reset
 - Multicast filtering & promisc support
 - Vlan offload
 - netpoll

And a couple of misc bits. This also adds the device-tree binding
documentation.

Re: [PATCH] netpoll: Check for skb->queue_mapping

2017-04-12 Thread tndave




On 04/06/2017 12:14 PM, Eric Dumazet wrote:

On Thu, 2017-04-06 at 12:07 -0700, tndave wrote:


+   q_index = q_index % dev->real_num_tx_queues;

cpu interrupted here and dev->real_num_tx_queues has reduced!

+   skb_set_queue_mapping(skb, q_index);
+   }
+   txq = netdev_get_tx_queue(dev, q_index);

or cpu interrupted here and dev->real_num_tx_queues has reduced!


If dev->real_num_tx_queues can be changed while this code is running we
are in deep deep trouble.

Better make sure that when control path does this change, device (and/pr
netpoll) is frozen and no packet can be sent.

When control path is making change to real_num_tx_queues, underlying
device is disabled; also netdev tx queues are stopped/disabled so
certainly no transmit is happening.


The corner case I was referring is if netpoll's queue_process() code is
interrupted and while it is not running, control path makes change to
dev->real_num_tx_queues and exits. Later on, interrupted queue_process()
resume execution and it can end up with wrong skb->queue_mapping and txq.
We can prevent this case with below change:

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 9424673..29be246 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -105,15 +105,21 @@ static void queue_process(struct work_struct *work)
while ((skb = skb_dequeue(&npinfo->txq))) {
struct net_device *dev = skb->dev;
struct netdev_queue *txq;
+   unsigned int q_index;

if (!netif_device_present(dev) || !netif_running(dev)) {
kfree_skb(skb);
continue;
}

-   txq = skb_get_tx_queue(dev, skb);
-
local_irq_save(flags);
+   /* check if skb->queue_mapping is still valid */
+   q_index = skb_get_queue_mapping(skb);
+   if (unlikely(q_index >= dev->real_num_tx_queues)) {
+   q_index = q_index % dev->real_num_tx_queues;
+   skb_set_queue_mapping(skb, q_index);
+   }
+   txq = netdev_get_tx_queue(dev, q_index);
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (netif_xmit_frozen_or_stopped(txq) ||
netpoll_start_xmit(skb, dev, txq) != NETDEV_TX_OK) {


Thanks for your patience.

-Tushar

Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg

2017-04-12 Thread Willem de Bruijn

On Wed, Apr 12, 2017 at 4:47 PM, Eric Dumazet  wrote:
> On Wed, 2017-04-12 at 13:07 -0700, Cong Wang wrote:
>> On Wed, Apr 12, 2017 at 8:39 AM, Willem de Bruijn
>>  wrote:
>> > ===
>> >> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg
>> >> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128
>> >
>> > Thanks for the report. This is accessing skb->dev from within recvmsg() at 
>> > line
>> >
>> > info->ipi_ifindex = skb->dev->ifindex;
>> >
>> > Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on
>> > errqueue with origin tstamp"). At this time the device may indeed have
>> > gone away. I'm having a look at a way to read this in the receive BH
>> > and store the ifindex.
>>
>> Why not use skb_iif?

This code is called from the error path for transmit timestamps.

We can make use of the fact that SKB_EXT_ERR used on enqueue has iif as
the first field in its control block. This also holds for the PKTINFO_SKB_CB
struct to which skb->cb is cast on dequeue when it copies pktinfo to userspace.
So if set on enqueue in __skb_complete_tx_timestamp, no conversion operation
is even needed on dequeue, let alone the currently buggy line that touches
skb->dev.

This iif cast was added for this purpose in the receive path in 0b922b7a829c
("net: original ingress device index in PKTINFO").

The device pointer is valid on enqueue for all paths called from device drivers,
as well as from dev_queue_xmit for SCM_TSTAMP_SCHED generation in
__dev_queue_xmit. The exception is SCM_TSTAMP_ACK generation, but
there skb->dev is NULL.

The v6 path does need a conversion, but already does this in
ip6_datagram_recv_common_ctl. There, too, we can remove the buggy
logic to set it from skb->dev->ifindex in ip6_datagram_support_cmsg.

I will send a patch.

RE: [PATCH net 1/1] net: tcp: Increase TCPABORTONLINGER when send RST by linger2 in keepalive timer

2017-04-12 Thread Gao Feng

Hi David,

> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, April 13, 2017 1:21 AM
> To: gfree.w...@foxmail.com
> Cc: kuz...@ms2.inr.ac.ru; jmor...@namei.org; ka...@trash.net;
> ncardw...@google.com; netdev@vger.kernel.org; f...@ikuai8.com
> Subject: Re: [PATCH net 1/1] net: tcp: Increase TCPABORTONLINGER when send
> RST by linger2 in keepalive timer
> 
> From: gfree.w...@foxmail.com
> Date: Sun,  9 Apr 2017 20:44:41 +0800
> 
> > From: Gao Feng 
> >
> > It should increase TCPABORTONLINGER counter when send RST caused by
> > linger2 in keepalive timer.
> >
> > Signed-off-by: Gao Feng 
> > ---
> >  net/ipv4/tcp_timer.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index
> > b2ab411..5c01f21 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -650,6 +650,8 @@ static void tcp_keepalive_timer (unsigned long data)
> > tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
> > goto out;
> > }
> > +   } else {
> > +   NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
> > }
> > tcp_send_active_reset(sk, GFP_ATOMIC);
> > goto death;
> 
> I think this else clause is completely unnecessary.  Just do it right
above the
> tcp_send_active_reset() call and at the same indentation level.
> 
> Alternatively, if you are trying to only bump the counter when tp->linger2
is
> >= 0, then you attached the else clause to the wrong if() test.
> 
> Thank you.

Actually I only want increase the TCPABORTONLINGER when linger2 is negative
which
means the timeout of FIN_WAIT2 is 0. We need to make socket state is closed
and send
one RST to peer.

Because tmo may be zero too, so increase this counter in the else block to
avoid it.

Best Regards
Feng

RE: [PATCH v3] smsc95xx: Add comments to the registers definition

2017-04-12 Thread Woojung.Huh

> I based my comments on the datasheet. For the LED_GPIO_CFG register, the
> datasheet says:
> > This register configures the external GPIO[2:0] pins.
> 
> QFN package description also indicates GPIOs 0, 1 & 2.
> As an example for the LAN9514, pin 22 of the QFN indicates:
> > nSPD_LED/GPIO2
> 
> In LED_GPIO_CFG register, GPCTL2 description indicates:
> > The value of this field determines the function of the external GPIO2
> > pin as follows
> 
> Do you confirm it's actually GPIO 10, 9 and 8?
> I think I may have misunderstood something.
Sorry forgetting that you are referring RPi which uses LAN9514.
Because these LEDs' GPIO can vary per chip (LAN9500, 9514..), it would be better
not putting GPIO number. LAN9500 are GPIO 10/9/8 as described.

> While we are here, could you indicate the meaning of the bit 2 of
> HW_CFG register (it's named HW_CFG_PSEL_)? It's the only bit I didn't
> succeed to comment because I didn't find it in the datasheet.
> I will then add it to the patch!
It indicates internal & external phy, PSEL means PHY Select.
You can find at LAN9500 doc in 
http://ww1.microchip.com/downloads/en/DeviceDoc/1875C.pdf.

> I'm also wondering what the meaning of STRAP_STATUS is. I could also
> comment it if you or Steve can provide the information.
It is marked as reserved in above LAN9500 manual.
You may guess from configuration straps in the manual and define names.

Woojung

Re: [RFC net-next] of: mdio: Honor hints from MDIO bus drivers

2017-04-12 Thread Andrew Lunn

> >>> To give some more background and rational for this change.
> >>>
> >>> On a platform where we have a parent MDIO bus, backed by the
> >>> mdio-bcm-unimac.c driver, we also register a slave MII bus (through
> >>> net/dsa/dsa2.c) which is parented to this UniMAC MDIO bus through an
> >>> assignment of of_node. This slave MII bus is created in order to
> >>> intercept reads/writes to problematic addresses (e.g: that clashes with
> >>> another piece of hardware).
> >>>
> >>> This means that the slave DSA MII bus inherits all child nodes from the
> >>> originating master MII bus. This also means that when the slave MII bus
> >>> is probed via of_mdiobus_register(), we probe the same devices twice:
> >>> once through the master, another time through the slave.
> >>
> >> Ah, O.K. This makes more sense. On the hardware i have, we get three
> >> deep in MDIO busses. We have the FEC mdio bus. On top of that we have
> >> a gpio-mux-mdio, and on top of that we have the mv88e6xxx mdio
> >> bus. And i've never seen issues.
> >>
> >> So your real problem here is you have two mdio busses using the same
> >> device tree properties. I would actually say that is just plain
> >> broken.
> > 
> > From a Device Tree/HW representation perspective, we do have the
> > external BCM53125 switch physically attached to the 7445/7278
> > SWITCH_MDIO bus (backed by mdio-bcm-unimac) so in that regard the
> > representation is correct. There is also an integrated Gigabit PHY
> > (bcm7xxx) which is attached to that bus.

This is made harder by you talking about a board which does not appear
to have its DT file in mainline. So i'm having to guess what it looks
like.

So what i think we are talking about is this bit of code:

static int bcm_sf2_mdio_register(struct dsa_switch *ds)
{
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
struct device_node *dn;
static int index;
int err;

/* Find our integrated MDIO bus node */
dn = of_find_compatible_node(NULL, NULL, "brcm,unimac-mdio");
priv->master_mii_bus = of_mdio_find_bus(dn);
if (!priv->master_mii_bus)
return -EPROBE_DEFER;

get_device(&priv->master_mii_bus->dev);
priv->master_mii_dn = dn;

priv->slave_mii_bus = devm_mdiobus_alloc(ds->dev);
if (!priv->slave_mii_bus)
return -ENOMEM;

priv->slave_mii_bus->priv = priv;
priv->slave_mii_bus->name = "sf2 slave mii";
priv->slave_mii_bus->read = bcm_sf2_sw_mdio_read;
priv->slave_mii_bus->write = bcm_sf2_sw_mdio_write;
snprintf(priv->slave_mii_bus->id, MII_BUS_ID_SIZE, "sf2-%d",
 index++);
priv->slave_mii_bus->dev.of_node = dn;

If i get you right, your switch is hanging off the MDIO bus
"brcm,unimac-mdio" you find the dn for. You then register another MDIO
bus using the exact same node? How does that make any sense? Isn't it
a physical separate MDIO bus? So it should have its own set of nodes
in the device tree. This is how we do it for the Marvell switches. See
Documentation/devicetree/binding/net/dsa/marvell.txt and
arch/arm/boot/dts/vf610-zii-dev-rev-b.dts. That DT blob uses
phy-handle to link the switch ports to the phys on the mdio bus.

   Andrew

Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread Eric Dumazet

On Wed, 2017-04-12 at 14:30 -0700, Stephen Hemminger wrote:
> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
> David Miller  wrote:
> 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b0aa089..071a58b 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1891,9 +1891,17 @@ struct net_device {
> > struct lock_class_key   *qdisc_tx_busylock;
> > struct lock_class_key   *qdisc_running_key;
> > boolproto_down;
> > +   struct bpf_prog __rcu   *xdp_prog;
> 
> It would be good if all devices could reuse this for the xdp_prog pointer.
> It would allow for could be used for introspection utility functions in 
> future.

Problem is that some xdp usages were envisioning a per RX queue xdp
program.

Re: [RFC net-next] of: mdio: Honor hints from MDIO bus drivers

2017-04-12 Thread Florian Fainelli

On 04/11/2017 04:23 PM, Florian Fainelli wrote:
> On 04/11/2017 04:14 PM, Andrew Lunn wrote:
>>> To give some more background and rational for this change.
>>>
>>> On a platform where we have a parent MDIO bus, backed by the
>>> mdio-bcm-unimac.c driver, we also register a slave MII bus (through
>>> net/dsa/dsa2.c) which is parented to this UniMAC MDIO bus through an
>>> assignment of of_node. This slave MII bus is created in order to
>>> intercept reads/writes to problematic addresses (e.g: that clashes with
>>> another piece of hardware).
>>>
>>> This means that the slave DSA MII bus inherits all child nodes from the
>>> originating master MII bus. This also means that when the slave MII bus
>>> is probed via of_mdiobus_register(), we probe the same devices twice:
>>> once through the master, another time through the slave.
>>
>> Ah, O.K. This makes more sense. On the hardware i have, we get three
>> deep in MDIO busses. We have the FEC mdio bus. On top of that we have
>> a gpio-mux-mdio, and on top of that we have the mv88e6xxx mdio
>> bus. And i've never seen issues.
>>
>> So your real problem here is you have two mdio busses using the same
>> device tree properties. I would actually say that is just plain
>> broken.
> 
> From a Device Tree/HW representation perspective, we do have the
> external BCM53125 switch physically attached to the 7445/7278
> SWITCH_MDIO bus (backed by mdio-bcm-unimac) so in that regard the
> representation is correct. There is also an integrated Gigabit PHY
> (bcm7xxx) which is attached to that bus.
> 
> From a SW perspective though, we want to talk to the integrated Gigabit
> PHY using mdio-bcm-unimac but talk to the external BCM53125 switch using
> the slave MII bus created by the bcm_sf2 driver in order to create an
> isolation. We need to inherit some of the parent (mdio-bcm-unimac) child
> DT nodes (such as the BCM53125), but not the GPHY. The easiest solution
> I found was to use this patch.
> 
> Using mdiobus_register() instead of of_mdiobus_register() was
> considered, but then, the child BCM53125 has no more "visbility" into
> the OF world at all, and it matters, because this switch is also driven
> via a DSA switch driver and its Ethernet data-path is connected to one
> port of the bcm_sf2 switch..
> 
> Thankfully the HW bug was fixed eventually ;)

In fact, all I need is to flag the internal Gigabit PHY for the slave
MII bus node with something that makes it appear as "disabled" which I
can presumably do with of_update_property() and putting a status =
"disabled" property in there. Let me do something like that and see how
big of a hack this becomes.
-- 
Florian

Re: [PATCH v2 00/10] ftgmac100: Rework batch 4 - Misc

On Wed, 2017-04-12 at 10:19 -0400, David Miller wrote:
> 
> > v2 Fixes patch 1/10 (NETIF_F_HW_CSUM conversion)
> > 
> > The next (and last) batch will add a few more "features" such
> > as netpoll, multicast/promist, vlan offload...
> > 
> 
> Series applied, thanks Benjamin.
> 
> I really like how you use the reset task to implement ring resizing.

Thanks :-)

I didn't see the point of doing it synchronously as it's not "urgent"
and this keeps the code a lot simpler.

Cheers,
Ben.

Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread Stephen Hemminger

On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
David Miller  wrote:

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b0aa089..071a58b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1891,9 +1891,17 @@ struct net_device {
>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
>   boolproto_down;
> + struct bpf_prog __rcu   *xdp_prog;

It would be good if all devices could reuse this for the xdp_prog pointer.
It would allow for could be used for introspection utility functions in future.

[Patch] nsfs: mark dentry with DCACHE_RCUACCESS