date:20160828

Have the fix for softirq landed???

2016-08-28 Thread Jesper Dangaard Brouer

Hi Eric and Paolo,

Back in May you discovered (IMHO) a serious bug in softirq.  It even got
covered by LWN.net[1]. I've looked at the most recent git tree (linus
and net-next), and cannot see any changes that fixes this.
What is the progress in this area?

LWN.net: "Threadable NAPI polling, softirqs, and proper fixes"
 [1] http://lwn.net/Articles/687617/

Eric's proposed fix: http://lwn.net/Articles/687631/

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net iproute2] devlink: Add e-switch support

2016-08-28 Thread Jiri Pirko

Sun, Aug 28, 2016 at 03:35:21PM CEST, ogerl...@mellanox.com wrote:
>Implement kernel devlink e-switch interface. Currently we allow
>to get and set the device e-switch mode.
>
>Signed-off-by: Or Gerlitz 
>Signed-off-by: Roi Dayan 

Acked-by: Jiri Pirko

[GIT] Networking

2016-08-28 Thread David Miller


1) Segregate namespaces properly in conntrack dumps, from Liping
   Zhang.

2) tcp listener refcount fix in netfilter tproxy, from Eric
   Dumazet.

3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz.

4) Fix use-after-free in tcp_xmit_retransmit_queue().

5) Userspace header fixups (use of __u32, missing includes, etc.)
   from Mikko Rapeli.

6) Further refinements to fragmentation wrt. gso and tunnels, from
   Shmulik Ladkani.

7) Trigger poll correctly for zero length UDP packets, from Eric
   Dumazet.

8) TCP window scaling fix, also from Eric Dumazet.

9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets.

10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet.

11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng.

12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric
Dumazet.

13) Add new device ID to alx driver, from Owen Lin.

Please pull, thanks a lot!

The following changes since commit 184ca823481c99dadd7d946e5afd4bb921eab30d:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-08-17 
17:26:58 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to b99b43bb4bdf1d361f7487cf03d803082bbf9101:

  Add Killer E2500 device ID in alx driver. (2016-08-29 00:23:50 -0400)


Alexander Duyck (1):
  ixgbe: Do not clear RAR entry when clearing VMDq for SAN MAC

Amir Vadai (1):
  net/mlx5: Update last-use statistics for flow rules

Andrew Rybchenko (1):
  sfc: fix potential stack corruption from running past stat bitmask

Anjali Singhai Jain (1):
  i40e: Change some init flow for the client

Colin Ian King (2):
  net: tehuti: fix typo: "eneble" -> "enable"
  net: hns: dereference ppe_cb->ppe_common_cb if it is non-null

Daniel Borkmann (1):
  Bluetooth: split sk_filter in l2cap_sock_recv_cb

Daniel Romell (1):
  net: xilinx: emaclite: Fallback to random MAC address.

David Ahern (1):
  net: diag: Fix refcnt leak in error path destroying socket

David Daney (1):
  net: thunderx: Fix OOPs with ethtool --register-dump

David S. Miller (5):
  Merge git://git.kernel.org/.../pablo/nf
  Merge branch 'kaweth-oopses'
  Merge branch 'mlx5-fixes'
  Merge branch 'for-upstream' of 
git://git.kernel.org/.../bluetooth/bluetooth
  Merge branch 'mlx5-series'

Eran Ben Elisha (2):
  net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
  net/mlx5: Add error prints when validate ETS failed

Eric Dumazet (7):
  netfilter: tproxy: properly refcount tcp listeners
  tcp: fix use after free in tcp_xmit_retransmit_queue()
  udp: fix poll() issue with zero sized packets
  tcp: properly scale window in tcp_v[46]_reqsk_send_ack()
  udp: get rid of SLAB_DESTROY_BY_RCU allocations
  qdisc: fix a module refcount leak in qdisc_create_dflt()
  rhashtable: fix a memory leak in alloc_bucket_locks()

Fabio Estevam (1):
  net: lpc_eth: Check clk_prepare_enable() error

Florian Fainelli (2):
  net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
  Documentation: networking: dsa: Remove platform device TODO

Frederic Dalleau (1):
  Bluetooth: Fix memory leak at end of hci requests

Gao Feng (2):
  l2tp: Fix the connect status check in pppol2tp_getname
  8139cp: Fix one possible deadloop in cp_rx_poll

Hadar Hen Zion (2):
  net/mlx5e: Use correct flow dissector key on flower offloading
  net/mlx5e: Retrieve the switchdev id from the firmware only once

Hariprasad Shenai (1):
  cxgb4: Fixes resource allocation for ULD's in kdump kernel

Ido Schimmel (1):
  mlxsw: spectrum: Add missing flood to router port

Jamal Hadi Salim (1):
  net sched: fix encoding to use real length

Jamie Lentin (1):
  net: mv88e6xxx: Fix ingress rate removal for mv6131 chips

Jiri Pirko (2):
  mlxsw: spectrum_buffers: Fix pool value handling in 
mlxsw_sp_sb_tc_pool_bind_set
  team: loadbalance: push lacpdus to exact delivery

Kamal Heib (1):
  net/mlx5e: Fix memory leak if refreshing TIRs fails

Lance Richardson (1):
  sctp: fix overrun in sctp_diag_dump_one()

Liping Zhang (5):
  netfilter: conntrack: do not dump other netns's conntrack entries via proc
  netfilter: nfnetlink_log: add "nf-logger-3-1" module alias name
  netfilter: nfnetlink_acct: report overquota to the right netns
  netfilter: nfnetlink_acct: fix race between nfacct del and xt_nfacct 
destroy
  netfilter: cttimeout: fix use after free error when delete netns

Luiz Augusto von Dentz (2):
  Bluetooth: Fix bt_sock_recvmsg when MSG_TRUNC is not set
  Bluetooth: Fix hci_sock_recvmsg when MSG_TRUNC is not set

Maor Gottlieb (1):
  net/mlx5: Increase number of ethtool steering priorities

Marcelo Ricardo Leitner (1):
  sctp: linearize early if it's not GSO

Mike M

Re: [net-next PATCH] e1000: add initial XDP support

2016-08-28 Thread John Fastabend

On 16-08-28 08:56 AM, William Tu wrote:
> Hi,
> 
> Reading through the patch, I found some minor typos below.
> 
> On Sat, Aug 27, 2016 at 12:11 AM, John Fastabend
>  wrote:
>> From: Alexei Starovoitov 
>>
>> This patch adds initial support for XDP on e1000 driver. Note e1000
>> driver does not support page recycling in general which could be
>> added as a further improvement. However for XDP_DROP and XDP_XMIT
> 
> I think you mean XDP_PASS instead of XDP_XMIT?
> 

I really meant XDP_TX but see Or's note and next revision will have
XDP_DROP only here.

>> the xdp code paths will recycle pages.
>>
>> This patch includes the rcu_read_lock/rcu_read_unlock pair noted by
>> Brenden Blanco in another pending patch.
>>
>>   net/mlx4_en: protect ring->xdp_prog with rcu_read_lock
>>
>> CC: William Tu 
>> Signed-off-by: Alexei Starovoitov 
>> Signed-off-by: John Fastabend 
>> ---
>>  drivers/net/ethernet/intel/e1000/e1000.h  |1
>>  drivers/net/ethernet/intel/e1000/e1000_main.c |  168 
>> -
>>  2 files changed, 165 insertions(+), 4 deletions(-)
>>
>> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>> +unsigned int len,
>> +struct net_device *netdev,
>> +struct e1000_adapter *adapter)
>> +{
>> +   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>> +   struct e1000_hw *hw = &adapter->hw;
>> +   struct e1000_tx_ring *tx_ring;
>> +
>> +   if (len > E1000_MAX_DATA_PER_TXD)
>> +   return;
>> +
>> +   /* e1000 only support a single txq at the moment so the queue is 
>> being
>> +* shared with stack. To support this requires locking to ensure the
>> +* stack and XPD are not running at the same time. Devices would
>> +* multiple queues should allocate a separate queue space.
>> +*/
> 
> XPD --> XDP
> Devices would --> with?

Yep typo.

> 
>> +   HARD_TX_LOCK(netdev, txq, smp_processor_id());
>> +
>> +   tx_ring = adapter->tx_ring;
>> +
>> +   if (E1000_DESC_UNUSED(tx_ring) < 2)
>> +   return;
>> +
>> +   e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
>> +
>> +   e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
>> +
>> +   writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
>> +   mmiowb();
>> +
>> +   HARD_TX_UNLOCK(netdev, txq);
>> +}
>> +
>>  #define NUM_REGS 38 /* 1 based count */
>>  static void e1000_regdump(struct e1000_adapter *adapter)
>>  {
>> @@ -4142,6 +4240,22 @@ static struct sk_buff *e1000_alloc_rx_skb(struct 
>> e1000_adapter *adapter,
>> return skb;
>>  }
>>
>> +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
>> +unsigned int length)
>> +{
>> +   struct xdp_buff xdp;
>> +   int ret;
>> +
>> +   xdp.data = data;
>> +   xdp.data_end = data + length;
>> +
>> +   rcu_read_lock();
>> +   ret = BPF_PROG_RUN(prog, (void *)&xdp);
>> +   rcu_read_unlock();
>> +
>> +   return ret;
>> +}
>> +
>>  /**
>>   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; 
>> legacy
>>   * @adapter: board private structure
>> @@ -4160,12 +4274,15 @@ static bool e1000_clean_jumbo_rx_irq(struct 
>> e1000_adapter *adapter,
>> struct pci_dev *pdev = adapter->pdev;
>> struct e1000_rx_desc *rx_desc, *next_rxd;
>> struct e1000_rx_buffer *buffer_info, *next_buffer;
>> +   struct bpf_prog *prog;
>> u32 length;
>> unsigned int i;
>> int cleaned_count = 0;
>> bool cleaned = false;
>> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>>
>> +   rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
>> +   prog = READ_ONCE(adapter->prog);
> 
> If having rcu_read_lock() here, do we still need another in e1000_call_bpf()?

nope good catch. Thanks for the review!

Re: [net-next PATCH] e1000: add initial XDP support

2016-08-28 Thread John Fastabend

On 16-08-27 10:55 PM, Or Gerlitz wrote:
> On Sat, Aug 27, 2016 at 10:11 AM, John Fastabend
>  wrote:
>> From: Alexei Starovoitov 
> 
>> This patch adds initial support for XDP on e1000 driver. Note e1000
>> driver does not support page recycling in general which could be
>> added as a further improvement. However for XDP_DROP and XDP_XMIT
>> the xdp code paths will recycle pages.
> 
>> @@ -4188,15 +4305,57 @@ static bool e1000_clean_jumbo_rx_irq(struct 
>> e1000_adapter *adapter,
>> prefetch(next_rxd);
>>
>> next_buffer = &rx_ring->buffer_info[i];
>> -
> 
> nit, better to avoid random cleanups in a patch adding new (&& cool)
> functionality
> 

Yep thanks.

[...]

>> +   case XDP_TX:
>> +   dma_sync_single_for_device(&pdev->dev,
>> +  dma,
>> +  length,
>> +  DMA_TO_DEVICE);
>> +   e1000_xmit_raw_frame(buffer_info, length,
>> +netdev, adapter);
>> +   /* Fallthrough to re-use mappedg page after xmit */
> 
> Did you want to say "mapped"? wasn't sure what's the role of "g" @ the end

Yep but see below...

> 
>> +   case XDP_DROP:
>> +   default:
>> +   /* re-use mapped page. keep buffer_info->dma
>> +* as-is, so that 
>> e1000_alloc_jumbo_rx_buffers
>> +* only needs to put it back into rx ring
>> +*/
> 
> if we're on the XDP_TX pass, don't we need to actually see that frame
> has been xmitted
> before re using the page?
> 

Agreed this seems to be too ambitious in the XDP_TX case. Thanks for
the help. Unless Alexei has some reason why it works I'll go ahead and
consume the buffer here.

I think setting

+   bi->rxbuf.page = NULL;

at the end of the XDP_TX case should fix it but I'll test it again,

Thanks again I guess this is what I get for trying to push patches out
on Friday night.



>> +   total_rx_bytes += length;
>> +   total_rx_packets++;
>> +   goto next_desc;
>> +   }
>> +   }
>> +
>> dma_unmap_page(&pdev->dev, buffer_info->dma,
>>adapter->rx_buffer_len, DMA_FROM_DEVICE);
>> buffer_info->dma = 0;

Re: [PATCH v3 net-next 1/1] net_sched: Introduce skbmod action

2016-08-28 Thread Cong Wang

On Sun, Aug 28, 2016 at 9:07 AM, Eric Dumazet  wrote:
>
> Adding an action with a spinlock held in fast path in 2016 is
> a way to tell people : It is a toy, do not use it for real.
>
> Sorry guys. Friends do not let friends do that anymore.
>

Please stop joking, this is not funny at all:

% git grep tcf_lock -- net/sched

[PATCH net v2 6/9] net: ethernet: mediatek: fix issue of driver removal with interface is up

2016-08-28 Thread sean.wang

From: Sean Wang 

mtk_stop() must be called to stop for freeing DMA
resources acquired and restoring state changed by mtk_open()
firstly when module removal.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 17dd2f8..1001317 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1903,6 +1903,14 @@ err_free_dev:
 static int mtk_remove(struct platform_device *pdev)
 {
struct mtk_eth *eth = platform_get_drvdata(pdev);
+   int i;
+
+   /* stop all devices to make sure that dma is properly shut down */
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->netdev[i])
+   continue;
+   mtk_stop(eth->netdev[i]);
+   }
 
clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
-- 
1.9.1

[PATCH net v2 8/9] net: ethernet: mediatek: use devm_mdiobus_alloc instead of mdiobus_alloc inside mtk_mdio_init

2016-08-28 Thread sean.wang

From: Sean Wang 

a lot of parts in the driver uses devm_* APIs to gain benefits from the
device resource management, so devm_mdiobus_alloc is also used instead
of mdiobus_alloc to have more elegant code flow.

Using common code provided by the devm_* helps to
1) have simplified the code flow as [1] says
2) decrease the risk of incorrect error handling by human
3) only a few drivers used it since it ware proposed on linux 3.16,
so just hope to promote for this.

Signed-off-by: Sean Wang 

---
ref.
[1] https://patchwork.ozlabs.org/patch/344093/
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 85a527a..f741c6a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -295,7 +295,7 @@ err_phy:
 static int mtk_mdio_init(struct mtk_eth *eth)
 {
struct device_node *mii_np;
-   int err;
+   int ret;
 
mii_np = of_get_child_by_name(eth->dev->of_node, "mdio-bus");
if (!mii_np) {
@@ -304,13 +304,13 @@ static int mtk_mdio_init(struct mtk_eth *eth)
}
 
if (!of_device_is_available(mii_np)) {
-   err = 0;
+   ret = 0;
goto err_put_node;
}
 
-   eth->mii_bus = mdiobus_alloc();
+   eth->mii_bus = devm_mdiobus_alloc(eth->dev);
if (!eth->mii_bus) {
-   err = -ENOMEM;
+   ret = -ENOMEM;
goto err_put_node;
}
 
@@ -321,20 +321,11 @@ static int mtk_mdio_init(struct mtk_eth *eth)
eth->mii_bus->parent = eth->dev;
 
snprintf(eth->mii_bus->id, MII_BUS_ID_SIZE, "%s", mii_np->name);
-   err = of_mdiobus_register(eth->mii_bus, mii_np);
-   if (err)
-   goto err_free_bus;
-   of_node_put(mii_np);
-
-   return 0;
-
-err_free_bus:
-   mdiobus_free(eth->mii_bus);
+   ret = of_mdiobus_register(eth->mii_bus, mii_np);
 
 err_put_node:
of_node_put(mii_np);
-   eth->mii_bus = NULL;
-   return err;
+   return ret;
 }
 
 static void mtk_mdio_cleanup(struct mtk_eth *eth)
@@ -343,8 +334,6 @@ static void mtk_mdio_cleanup(struct mtk_eth *eth)
return;
 
mdiobus_unregister(eth->mii_bus);
-   of_node_put(eth->mii_bus->dev.of_node);
-   mdiobus_free(eth->mii_bus);
 }
 
 static inline void mtk_irq_disable(struct mtk_eth *eth, u32 mask)
-- 
1.9.1

[PATCH net v2 5/9] net: ethernet: mediatek: fix logic unbalance between probe and remove

2016-08-28 Thread sean.wang

From: Sean Wang 

original mdio_cleanup is not in the symmetric place against where
mdio_init is, so relocate mdio_cleanup to the right one.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 2c5754e..17dd2f8 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1508,7 +1508,6 @@ static void mtk_uninit(struct net_device *dev)
struct mtk_eth *eth = mac->hw;
 
phy_disconnect(mac->phy_dev);
-   mtk_mdio_cleanup(eth);
mtk_irq_disable(eth, ~0);
 }
 
@@ -1913,6 +1912,7 @@ static int mtk_remove(struct platform_device *pdev)
netif_napi_del(ð->tx_napi);
netif_napi_del(ð->rx_napi);
mtk_cleanup(eth);
+   mtk_mdio_cleanup(eth);
 
return 0;
 }
-- 
1.9.1

[PATCH net v2 2/9] net: ethernet: mediatek: fix incorrect return value of devm_clk_get with EPROBE_DEFER

2016-08-28 Thread sean.wang

From: Sean Wang 

1) If the return value of devm_clk_get is EPROBE_DEFER, we should
defer probing the driver. The change is verified and works based
on 4.8-rc1 staying with the latest clk-next code for MT7623.
2) Changing with the usage of loops to work out if all clocks are
fine

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 39 -
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 22 ++--
 2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 6e4a6ca..ad4865c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -50,6 +50,10 @@ static const struct mtk_ethtool_stats {
MTK_ETHTOOL_STAT(rx_flow_control_packets),
 };
 
+static const char * const mtk_clks_source_name[] = {
+   "ethif", "esw", "gp1", "gp2"
+};
+
 void mtk_w32(struct mtk_eth *eth, u32 val, unsigned reg)
 {
__raw_writel(val, eth->base + reg);
@@ -1811,6 +1815,7 @@ static int mtk_probe(struct platform_device *pdev)
if (!eth)
return -ENOMEM;
 
+   eth->dev = &pdev->dev;
eth->base = devm_ioremap_resource(&pdev->dev, res);
if (IS_ERR(eth->base))
return PTR_ERR(eth->base);
@@ -1845,21 +1850,21 @@ static int mtk_probe(struct platform_device *pdev)
return -ENXIO;
}
}
+   for (i = 0; i < ARRAY_SIZE(eth->clks); i++) {
+   eth->clks[i] = devm_clk_get(eth->dev,
+   mtk_clks_source_name[i]);
+   if (IS_ERR(eth->clks[i])) {
+   if (PTR_ERR(eth->clks[i]) == -EPROBE_DEFER)
+   return -EPROBE_DEFER;
+   return -ENODEV;
+   }
+   }
 
-   eth->clk_ethif = devm_clk_get(&pdev->dev, "ethif");
-   eth->clk_esw = devm_clk_get(&pdev->dev, "esw");
-   eth->clk_gp1 = devm_clk_get(&pdev->dev, "gp1");
-   eth->clk_gp2 = devm_clk_get(&pdev->dev, "gp2");
-   if (IS_ERR(eth->clk_esw) || IS_ERR(eth->clk_gp1) ||
-   IS_ERR(eth->clk_gp2) || IS_ERR(eth->clk_ethif))
-   return -ENODEV;
-
-   clk_prepare_enable(eth->clk_ethif);
-   clk_prepare_enable(eth->clk_esw);
-   clk_prepare_enable(eth->clk_gp1);
-   clk_prepare_enable(eth->clk_gp2);
+   clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
+   clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
+   clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
+   clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
 
-   eth->dev = &pdev->dev;
eth->msg_enable = netif_msg_init(mtk_msg_level, MTK_DEFAULT_MSG_ENABLE);
INIT_WORK(ð->pending_work, mtk_pending_work);
 
@@ -1902,10 +1907,10 @@ static int mtk_remove(struct platform_device *pdev)
 {
struct mtk_eth *eth = platform_get_drvdata(pdev);
 
-   clk_disable_unprepare(eth->clk_ethif);
-   clk_disable_unprepare(eth->clk_esw);
-   clk_disable_unprepare(eth->clk_gp1);
-   clk_disable_unprepare(eth->clk_gp2);
+   clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
 
netif_napi_del(ð->tx_napi);
netif_napi_del(ð->rx_napi);
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index f82e3ac..6e1ade7 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -290,6 +290,17 @@ enum mtk_tx_flags {
MTK_TX_FLAGS_PAGE0  = 0x02,
 };
 
+/* This enum allows us to identify how the clock is defined on the array of the
+ * clock in the order
+ */
+enum mtk_clks_map {
+   MTK_CLK_ETHIF,
+   MTK_CLK_ESW,
+   MTK_CLK_GP1,
+   MTK_CLK_GP2,
+   MTK_CLK_MAX
+};
+
 /* struct mtk_tx_buf - This struct holds the pointers to the memory pointed at
  * by the TX descriptors
  * @skb:   The SKB pointer of the packet being sent
@@ -370,10 +381,7 @@ struct mtk_rx_ring {
  * @scratch_ring:  Newer SoCs need memory for a second HW managed TX ring
  * @phy_scratch_ring:  physical address of scratch_ring
  * @scratch_head:  The scratch memory that scratch_ring points to.
- * @clk_ethif: The ethif clock
- * @clk_esw:   The switch clock
- * @clk_gp1:   The gmac1 clock
- * @clk_gp2:   The gmac2 clock
+ * @clks:  clock array for all clocks required
  * @mii_bus:   If there is a bus we need to create an instance for it
  * @pending_work:  The workqueue used to reset the dma ring
  */
@@ -400,10 +408,8 @@ struct mtk_eth {
struct mtk_tx_dma   *scratch_ring;
dma_addr_t  phy_scratch_ring;
voi

[PATCH net v2 3/9] net: ethernet: mediatek: fix API usage with skb_free_frag

2016-08-28 Thread sean.wang

From: Sean Wang 

use skb_free_frag() instead of legacy put_page()

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index ad4865c..518d987 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -868,7 +868,7 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
/* receive data */
skb = build_skb(data, ring->frag_size);
if (unlikely(!skb)) {
-   put_page(virt_to_head_page(new_data));
+   skb_free_frag(new_data);
netdev->stats.rx_dropped++;
goto release_desc;
}
-- 
1.9.1

[PATCH net v2 7/9] net: ethernet: mediatek: fix the missing of_node_put() after node is used done inside mtk_mdio_init

2016-08-28 Thread sean.wang

From: Sean Wang 

This patch adds the missing of_node_put() after finishing the usage
of of_get_child_by_name.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 1001317..85a527a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -324,6 +324,7 @@ static int mtk_mdio_init(struct mtk_eth *eth)
err = of_mdiobus_register(eth->mii_bus, mii_np);
if (err)
goto err_free_bus;
+   of_node_put(mii_np);
 
return 0;
 
-- 
1.9.1

[PATCH net v2 4/9] net: ethernet: mediatek: remove redundant free_irq for devm_request_irq allocated irq

2016-08-28 Thread sean.wang

From: Sean Wang 

these irqs are not used for shared irq and disabled during ethernet stops.
irq requested by devm_request_irq is safe to be freed automatically on
driver detach.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 518d987..2c5754e 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1510,8 +1510,6 @@ static void mtk_uninit(struct net_device *dev)
phy_disconnect(mac->phy_dev);
mtk_mdio_cleanup(eth);
mtk_irq_disable(eth, ~0);
-   free_irq(eth->irq[1], dev);
-   free_irq(eth->irq[2], dev);
 }
 
 static int mtk_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
-- 
1.9.1

[PATCH net v2 0/9] net: ethernet: mediatek: a couple of fixes

2016-08-28 Thread sean.wang

From: Sean Wang 

a couple of fixes come out from integrating with linux-4.8 rc1
they all are verified and workable on linux-4.8 rc1

changes since v1:
- usage of loops to work out if all required clock are ready instead
of tedious coding
- remove redundant pinctrl setup that is already done by core driver
thanks for careful and patient reviewing by Andrew Lunn
- splitting distinct changes into the separate patches
- change variable naming from err to ret for readable coding

Sean Wang (9):
  net: ethernet: mediatek: fix fails from TX housekeeping due to
incorrect port setup
  net: ethernet: mediatek: fix incorrect return value of devm_clk_get
with EPROBE_DEFER
  net: ethernet: mediatek: fix API usage with skb_free_frag
  net: ethernet: mediatek: remove redundant free_irq for
devm_request_irq allocated irq
  net: ethernet: mediatek: fix logic unbalance between probe and remove
  net: ethernet: mediatek: fix issue of driver removal with interface is
up
  net: ethernet: mediatek: fix the missing of_node_put() after node is
used done inside mtk_mdio_init
  net: ethernet: mediatek: use devm_mdiobus_alloc instead of
mdiobus_alloc inside mtk_mdio_init
  net: ethernet: mediatek: fix error handling inside mtk_mdio_init

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 82 +++--
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 22 +---
 2 files changed, 56 insertions(+), 48 deletions(-)

-- 
1.9.1

Re: [PATCH net-next V3 4/4] net/sched: Introduce act_tunnel_key

2016-08-28 Thread Cong Wang

On Fri, Aug 26, 2016 at 12:16 PM, Eric Dumazet  wrote:
> On Fri, 2016-08-26 at 11:26 -0700, Cong Wang wrote:
>> 1) Currently there are only a few actions using lockless, and they are
>> questionable, as we already discussed before, there could be some
>> race condition when you modify an existing action.
>
> There is no fundamental issue with a race condition.

For mirred action, maybe. As we already discussed, the more
complex an action is, the harder to make it lockless in your
way (that is, not using RCU)

>
> Sure, there are races, but they have no serious effect.
>
> Feel free to send a fix if you really have time to spare.

It's because the code is written by you?

I am surprised how you try to hide your own problem in
such a way...

>
>>
>> 2) We need to change the tc action API in order to fully support RCU,
>> which is what I have been working on these days. I should come up
>> with something next Monday (if not this weekend).
>>
>> So for this patchset, using spinlock is fine, just as many other actions.
>> I will take care of it later.
>
> This is _not_ fine.

OK, so where are your patches to make the rest actions
lockless?

>
> We are in 2016, not in 1995 anymore.
>

Fair enough, sounds like all actions are already lockless in
fast path now in 2016, you know this is not true...

> We are not adding a spinlock in a hot path unless absolutely needed.

If it is bug-free, yes, I am totally with you. I care about corretness
more than any performance.

>
> With multi queue NIC, this spinlock is going to hurt performance so much
> that this action wont be used by any serious user.

We have used mirred action even before you make it lockless.

>
> Here, it is absolutely trivial to use RCU and/or percpu counters.

Sounds like we don't need any API change, why not go ahead
and try it? Please do teach me how to modify an existing
action in a lockless way without changing any API (and of course
needs to be bug-free), I am very happy to learn your "trivial" way
to fix this, since I don't have any trivial fix.

Please, stop bullsh*t, show me your trivial code.

[PATCH net v2 1/9] net: ethernet: mediatek: fix fails from TX housekeeping due to incorrect port setup

2016-08-28 Thread sean.wang

From: Sean Wang 

which net device the SKB is complete for depends on the forward port
on txd4 on the corresponding TX descriptor, but the information isn't
set up well in case of  SKB fragments that would lead to watchdog timeout
from the upper layer, so fix it up.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 1801fd8..6e4a6ca 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -587,14 +587,15 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
dma_addr_t mapped_addr;
unsigned int nr_frags;
int i, n_desc = 1;
-   u32 txd4 = 0;
+   u32 txd4 = 0, fport;
 
itxd = ring->next_free;
if (itxd == ring->last_free)
return -ENOMEM;
 
/* set the forward port */
-   txd4 |= (mac->id + 1) << TX_DMA_FPORT_SHIFT;
+   fport = (mac->id + 1) << TX_DMA_FPORT_SHIFT;
+   txd4 |= fport;
 
tx_buf = mtk_desc_to_tx_buf(ring, itxd);
memset(tx_buf, 0, sizeof(*tx_buf));
@@ -652,7 +653,7 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
WRITE_ONCE(txd->txd3, (TX_DMA_SWC |
   TX_DMA_PLEN0(frag_map_size) |
   last_frag * TX_DMA_LS0));
-   WRITE_ONCE(txd->txd4, 0);
+   WRITE_ONCE(txd->txd4, fport);
 
tx_buf->skb = (struct sk_buff *)MTK_DMA_DUMMY_DESC;
tx_buf = mtk_desc_to_tx_buf(ring, txd);
-- 
1.9.1

[PATCH net v2 9/9] net: ethernet: mediatek: fix error handling inside mtk_mdio_init

2016-08-28 Thread sean.wang

From: Sean Wang 

return -ENODEV if no child is found in MDIO bus.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index f741c6a..e48b2a4 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -304,7 +304,7 @@ static int mtk_mdio_init(struct mtk_eth *eth)
}
 
if (!of_device_is_available(mii_np)) {
-   ret = 0;
+   ret = -ENODEV;
goto err_put_node;
}
 
-- 
1.9.1

Re: [PATCH 4/5] net_sched: fix use of uninitialized ethertype variable in cls_flower

2016-08-28 Thread David Miller

From: Arnd Bergmann 
Date: Fri, 26 Aug 2016 17:25:45 +0200

> The addition of VLAN support caused a possible use of uninitialized
> data if we encounter a zero TCA_FLOWER_KEY_ETH_TYPE key, as pointed
> out by "gcc -Wmaybe-uninitialized":
> 
> net/sched/cls_flower.c: In function 'fl_change':
> net/sched/cls_flower.c:366:22: error: 'ethertype' may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
> 
> This changes the code to only set the ethertype field if it
> was nonzero, as before the patch.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 9399ae9a6cb2 ("net_sched: flower: Add vlan support")

Applied.

Re: [PATCH 5/5] net/xgene: fix error handling during reset

2016-08-28 Thread David Miller

From: Arnd Bergmann 
Date: Fri, 26 Aug 2016 17:25:46 +0200

> The newly added reset logic uses helper functions for the MMIO that
> may fail. However, when the read operation fails, we end up writing
> back uninitialized data to the register, as gcc warns:
> 
> drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c: In function 
> 'xgene_enet_link_state':
> drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c:213:2: error: 'data' may be 
> used uninitialized in this function [-Werror=maybe-uninitialized]
> drivers/net/ethernet/apm/xgene/xgene_enet_xgmac.c:209:6: note: 'data' was 
> declared here
>   u32 data;
> 
> We already print a warning to the console log if that happens,
> the best alternative that I can see is skip the rest of the reset
> sequence if the register value cannot be read: Most likely the
> write would fail as well, and if it succeeded, worse things could
> happen.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 3eb7cb9dc946 ("drivers: net: xgene: XFI PCS reset when link is down")

Applied.

[Patch net] kcm: fix a socket double free

2016-08-28 Thread Cong Wang

Dmitry reported a double free on kcm socket, which could
be easily reproduced by:

#include 
#include 

int main()
{
  int fd = syscall(SYS_socket, 0x29ul, 0x5ul, 0x0ul, 0, 0, 0);
  syscall(SYS_ioctl, fd, 0x89e2ul, 0x20a98000ul, 0, 0, 0);
  return 0;
}

This is because on the error path, after we install
the new socket file, we call sock_release() to clean
up the socket, which leaves the fd pointing to a freed
socket. Fix this by calling sys_close() on that fd
directly.

Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-by: Dmitry Vyukov 
Cc: Tom Herbert 
Signed-off-by: Cong Wang 
---
 net/kcm/kcmsock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index cb39e05..4116932 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2029,7 +2030,7 @@ static int kcm_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
if (copy_to_user((void __user *)arg, &info,
 sizeof(info))) {
err = -EFAULT;
-   sock_release(newsock);
+   sys_close(info.fd);
}
}
 
-- 
1.8.4.5

Re: [PATCH net-next] net: ethtool: add support for 1000BaseX and missing 10G link modes

2016-08-28 Thread David Miller

From: Vidya Sagar Ravipati 
Date: Fri, 26 Aug 2016 01:25:50 -0700

> From: Vidya Sagar Ravipati 
> 
> This patch enhances ethtool link mode bitmap to include
> missing interface modes for 1G/10G speeds
 ...
> Signed-off-by: Vidya Sagar Ravipati 

Applied.

Re: [RESEND PATCH net 06/10] net: ethernet: mediatek: fix the loss

2016-08-28 Thread Sean Wang

Date: Fri, 26 Aug 2016 16:17:59 +0200, Andrew Lunn wrote:
>> Hi Andrew,
>> 
>> Here pinctrl is used to setup what function the group of the pins is
>> for.
>
>Agreed.
> 
>> The group of the pins could be configured for the function provided 
>> by the SoC, such as general purpose I/O or specific function such as
>> ethernet depending on what products or boards you design for various 
>> customers or vendors. Thanks for device tree introducing, it is easy 
>> to find what resources the board needs including the pins usage is 
>> also defined here.
>
>All clear. However, if the ethernet driver has loaded, it means the
>device tree says the ethernet should be loaded, unless it happens to
>be on some discoverable bus. And so the device tree node for the
>ethernet should also contain the needed pinctrl properties.  The core
>driver code should of seen these properties and already enabled the
>correct pinctrl state before the driver probes.
>
>This is how every other driver works. Like i said, i don't think i've
>seen any other driver do its own pinctrl. So i just need a simple
>description, what is different here, why does this driver need to do
>it, when no other does?
>
>Andrew
>

You are right
all that I need about pinctrl are all being done with core driver 
as you said, so the patch I did seems the redundant work and i will remove 
it from the patch set.

thanks for your patient and careful reviewing and that also helps me getting 
familiar with based driver with pinctrl more :)

Sean

Re: [PATCH net-next] amd-xgbe: Reset running devices after resume from hibernate

2016-08-28 Thread David Miller

From: James Morse 
Date: Fri, 26 Aug 2016 09:21:23 +0100

> After resume from hibernate on arm64, any amd-xgbe devices that were
> running when we hibernated are reported as down, even when it is not.
> 
> Re-plugging the cables does not cause the interface to come back, the
> link must be marked as down then up via 'ip set link' using the serial
> console.
> 
> This happens because the device has been power-cycled and possibly
> re-initialised by firmware, whereas the driver's memory structures have
> been restored from the hibernate image and the two do not agree.
> 
> Schedule a restart of the device after powerup in case the world changed
> while we were asleep.
> 
> Signed-off-by: James Morse 

Applied.

Re: [PATCH 1/1] net: add killer E2500 device id

2016-08-28 Thread David Miller

From: Owen Lin 
Date: Fri, 26 Aug 2016 05:55:45 +

> From 5a40989933c7dcd904bebd3c64eaf84445fad1fd Mon Sep 17 00:00:00 2001
> From: Owen Lin 
> Date: Fri, 26 Aug 2016 13:49:09 +0800
> Subject: [PATCH] Add Killer E2500 device ID in alx driver.

Applied, thanks.

Re: [PATCH net-next] tcp: add tcp_add_backlog()

2016-08-28 Thread David Miller

From: Eric Dumazet 
Date: Sat, 27 Aug 2016 07:37:54 -0700

> From: Eric Dumazet 
> 
> When TCP operates in lossy environments (between 1 and 10 % packet
> losses), many SACK blocks can be exchanged, and I noticed we could
> drop them on busy senders, if these SACK blocks have to be queued
> into the socket backlog.
> 
> While the main cause is the poor performance of RACK/SACK processing,
> we can try to avoid these drops of valuable information that can lead to
> spurious timeouts and retransmits.
> 
> Cause of the drops is the skb->truesize overestimation caused by :
> 
> - drivers allocating ~2048 (or more) bytes as a fragment to hold an
>   Ethernet frame.
> 
> - various pskb_may_pull() calls bringing the headers into skb->head
>   might have pulled all the frame content, but skb->truesize could
>   not be lowered, as the stack has no idea of each fragment truesize.
> 
> The backlog drops are also more visible on bidirectional flows, since
> their sk_rmem_alloc can be quite big.
> 
> Let's add some room for the backlog, as only the socket owner
> can selectively take action to lower memory needs, like collapsing
> receive queues or partial ofo pruning.
> 
> Signed-off-by: Eric Dumazet 

Really nice change, thanks Eric.

Re: [PATCH v2] net: smc91x: fix SMC accesses

2016-08-28 Thread David Miller

From: Russell King 
Date: Sat, 27 Aug 2016 17:33:03 +0100

> Commit b70661c70830 ("net: smc91x: use run-time configuration on all ARM
> machines") broke some ARM platforms through several mistakes.  Firstly,
> the access size must correspond to the following rule:
> 
> (a) at least one of 16-bit or 8-bit access size must be supported
> (b) 32-bit accesses are optional, and may be enabled in addition to
> the above.
> 
> Secondly, it provides no emulation of 16-bit accesses, instead blindly
> making 16-bit accesses even when the platform specifies that only 8-bit
> is supported.
> 
> Reorganise smc91x.h so we can make use of the existing 16-bit access
> emulation already provided - if 16-bit accesses are supported, use
> 16-bit accesses directly, otherwise if 8-bit accesses are supported,
> use the provided 16-bit access emulation.  If neither, BUG().  This
> exactly reflects the driver behaviour prior to the commit being fixed.
> 
> Since the conversion incorrectly cut down the available access sizes on
> several platforms, we also need to go through every platform and fix up
> the overly-restrictive access size: Arnd assumed that if a platform can
> perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access
> size needed to be specified - not so, all available access sizes must
> be specified.
> 
> This likely fixes some performance regressions in doing this: if a
> platform does not support 8-bit accesses, 8-bit accesses have been
> emulated by performing a 16-bit read-modify-write access.
> 
> Tested on the Intel Assabet/Neponset platform, which supports only 8-bit
> accesses, which was broken by the original commit.
> 
> Fixes: b70661c70830 ("net: smc91x: use run-time configuration on all ARM 
> machines")
> Signed-off-by: Russell King 

Applied and queued up for -stable, thanks Russell.

Re: [PATCH] Documentation: networking: dsa: Remove platform device TODO

2016-08-28 Thread David Miller

From: Florian Fainelli 
Date: Sat, 27 Aug 2016 15:34:20 -0700

> Since commit 83c0afaec7b7 ("net: dsa: Add new binding implementation"),
> the shortcomings of the dsa platform device have been addressed, remove
> that TODO item.
> 
> Signed-off-by: Florian Fainelli 

Applied.

Re: [PATCH] wan/fsl_ucc_hdlc: fix spelling mistake "prameter" -> "parameter"

2016-08-28 Thread David Miller

From: Colin King 
Date: Sun, 28 Aug 2016 11:40:41 +0100

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in dev_err message.
> 
> Signed-off-by: Colin Ian King 

Applied.

Re: [PATCH] cxgb4/cxgb4vf: fix spelling mistake "provissioned" -> "provisioned"

2016-08-28 Thread David Miller

From: Colin King 
Date: Sun, 28 Aug 2016 12:07:02 +0100

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in dev_warn message.
> 
> Signed-off-by: Colin Ian King 

Applied.

Re: [PATCH] net: ucc_geth: fix spelling mistake "propperty" -> "property"

2016-08-28 Thread David Miller

From: Colin King 
Date: Sun, 28 Aug 2016 12:03:27 +0100

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in dev_warn message.
> 
> Signed-off-by: Colin Ian King 

Applied.

Re: [PATCH v3 0/5] meson: Meson8b and GXBB DWMAC glue driver

2016-08-28 Thread David Miller

From: Martin Blumenstingl 
Date: Sun, 28 Aug 2016 18:16:32 +0200

> This adds a DWMAC glue driver for the PRG_ETHERNET registers found in
> Meson8b and GXBB SoCs. Based on the "old" meson6b-dwmac glue driver
> the register layout is completely different.
> Thus I introduced a separate driver.
> 
> Changes since v2:
> - fixed unloading the glue driver when built as module. This pulls in a
>   patch from Joachim Eastwood (thanks) to get our private data structure
>   (bsp_priv).

This doesn't apply cleanly at all to the net-next tree, so I have
no idea where you expect these changes to be applied.

Re: [PATCH net-next 0/3] strp: Generalize stream parser to work with other socket types

2016-08-28 Thread David Miller

From: Tom Herbert 
Date: Sun, 28 Aug 2016 14:43:16 -0700

> Add a read_sock protocol operation function that allows something like
> tcp_read_sock to be called for other protocol types.
> 
> Specific changes in this patch set:
>   - Add read_sock function to proto_ops. This has the same signature as
> tcp_read_sock. sk_read_actor_t is also defined in net.h.
>   - Set peek_len and read_sock proto_op functions for TCPv4 and TCPv6
> stream ops.
>   - Remove references to tcp in strparser.
>   - Call peek_len and read_sock operations from strparser instead of
> calling TCP specific functions.

I'll apply this, but I want you to shore up these new ops.

A check has to happen somewhere to make sure the proto_ops in
question have a non-NULL read_sock and peek_len method before
starting to use it.

Re: [PATCH net 0/9] Mellanox 100G mlx5 fixes 2016-08-29

2016-08-28 Thread David Miller

From: Saeed Mahameed 
Date: Mon, 29 Aug 2016 01:13:41 +0300

> This series contains some bug fixes for the mlx5 core and mlx5
> ethernet driver.
> 
> From Saeed, Fix UMR to consider hardware translation table field
> size limitation when calculating the maximum number of MTTs required
> by the driver.  Three patches to speed-up netdevice close time by
> serializing channel (SQs & RQs) destruction rather than issuing and
> waiting for hardware interrupts to free them.
> 
> From Eran, Fix ethtool ring parameter reporting for striding RQ layout.
> Add error prints on ETS validation failure.
> 
> From Kamal, Fix memory leak on error flow.
> 
> From Maor, Fix ethtool steering priorities number.

Series applied.

> For -stable of 4.7.y:
>   net/mlx5e: Limit UMR length to the device's limitation
>   net/mlx5e: Don't wait for RQ completions on close 
>   net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
>   net/mlx5e: Don't wait for SQ completions on close
>   net/mlx5e: Add ethtool counter for TX xmit_more

Queued up, thanks.

Re: kcm: use-after-free in fput of kcm socket

2016-08-28 Thread Cong Wang

On Sun, Aug 28, 2016 at 3:10 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program triggers use-after-free:
>
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #include 
>
> int main()
> {
>   int fd = syscall(SYS_socket, 0x29ul, 0x5ul, 0x0ul, 0, 0, 0);
>   syscall(SYS_ioctl, fd, 0x89e2ul, 0x20a98000ul, 0, 0, 0);
>   return 0;
> }
>
>
> [  367.240184] 
> ==
> [  367.240784] BUG: KASAN: use-after-free in __fput+0x65a/0x780 at
> addr 880069bc4b30
> [  367.241034] Read of size 2 by task a.out/4045
> [  367.241034] CPU: 3 PID: 4045 Comm: a.out Not tainted 4.8.0-rc3+ #34
> [  367.241034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  367.241034]  884b8280 880038fb7bc0 82d1b1d9
> 00622e00
> [  367.241034]  fbfff1097050 88003e198900 880069bc4b00
> 880069bc4ec0
> [  367.241034]  880069bc4b30 859e90a0 880038fb7be8
> 817da1fc
> [  367.241034] Call Trace:
> [  367.241034]  [] dump_stack+0x12e/0x185
> [  367.241034]  [] ? sock_release+0x1d0/0x1d0
> [  367.241034]  [] kasan_object_err+0x1c/0x70
> [  367.241034]  [] kasan_report_error+0x1ae/0x490
> [  367.241034]  [] ? sock_release+0x1d0/0x1d0
> [  367.241034]  [] __asan_report_load2_noabort+0x3e/0x40
> [  367.241034]  [] ? __fput+0x65a/0x780
> [  367.241034]  [] __fput+0x65a/0x780
> [  367.241034]  [] fput+0x15/0x20
> [  367.241034]  [] task_work_run+0xf3/0x170
> [  367.241034]  [] do_exit+0x868/0x2c10
> [  367.241034]  [] ? sock_ioctl+0x1db/0x3d0
> [  367.241034]  [] ? sock_do_ioctl+0xb0/0xb0
> [  367.241034]  [] ? do_vfs_ioctl+0x430/0x1080
> [  367.241034]  [] ? mm_update_next_owner+0x640/0x640
> [  367.241034]  [] ? ioctl_preallocate+0x210/0x210
> [  367.241034]  [] ? bad_area+0x69/0x80
> [  367.241034]  [] ? exit_to_usermode_loop+0x3e/0x210
> [  367.241034]  [] ? entry_SYSCALL_64_fastpath+0x5/0xc1
> [  367.241034]  [] do_group_exit+0x108/0x330
> [  367.241034]  [] SyS_exit_group+0x1d/0x20
> [  367.241034]  [] entry_SYSCALL_64_fastpath+0x23/0xc1


Hmm, we have a double free here. I have a patch to fix it, will send it out
very soon.

Thanks!


> [  367.241034] Object at 880069bc4b00, in cache sock_inode_cache size: 960
> [  367.241034] Allocated:
> [  367.241034] PID = 4045
> [  367.241034]  [] save_stack_trace+0x26/0x50
> [  367.241034]  [] save_stack+0x46/0xd0
> [  367.241034]  [] kasan_kmalloc+0xad/0xe0
> [  367.241034]  [] kasan_slab_alloc+0x12/0x20
> [  367.241034]  [] kmem_cache_alloc+0x12b/0x710
> [  367.241034]  [] sock_alloc_inode+0x1d/0x250
> [  367.241034]  [] alloc_inode+0x61/0x180
> [  367.241034]  [] new_inode_pseudo+0x17/0xe0
> [  367.241034]  [] sock_alloc+0x41/0x280
> [  367.241034]  [] kcm_ioctl+0x9b3/0x13e0
> [  367.241034]  [] sock_do_ioctl+0x65/0xb0
> [  367.241034]  [] sock_ioctl+0x2d2/0x3d0
> [  367.241034]  [] do_vfs_ioctl+0x18c/0x1080
> [  367.241034]  [] SyS_ioctl+0x8f/0xc0
> [  367.241034]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
> [  367.241034] Freed:
> [  367.241034] PID = 4045
> [  367.241034]  [] save_stack_trace+0x26/0x50
> [  367.241034]  [] save_stack+0x46/0xd0
> [  367.241034]  [] kasan_slab_free+0x72/0xc0
> [  367.241034]  [] kmem_cache_free+0x76/0x300
> [  367.241034]  [] sock_destroy_inode+0x56/0x70
> [  367.241034]  [] destroy_inode+0xc7/0x130
> [  367.241034]  [] evict+0x329/0x500
> [  367.241034]  [] iput+0x495/0x930
> [  367.241034]  [] sock_release+0x164/0x1d0
> [  367.241034]  [] sock_close+0x16/0x20
> [  367.241034]  [] __fput+0x236/0x780
> [  367.241034]  [] fput+0x15/0x20
> [  367.241034]  [] task_work_run+0xf3/0x170
> [  367.241034]  [] do_exit+0x868/0x2c10
> [  367.241034]  [] do_group_exit+0x108/0x330
> [  367.241034]  [] SyS_exit_group+0x1d/0x20
> [  367.241034]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
> [  367.241034] Memory state around the buggy address:
> [  367.241034]  880069bc4a00: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [  367.241034]  880069bc4a80: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [  367.241034] >880069bc4b00: fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [  367.241034]  ^
> [  367.241034]  880069bc4b80: fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [  367.241034]  880069bc4c00: fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb fb
> [  367.241034] 
> ==
>
>
> It is then followed by a bunch of other bugs, full log is here:
> https://gist.githubusercontent.com/dvyukov/b9884388bee40b792ae7900928358484/raw/ace2fa242468d584fa61bf753a5891faa71b0932/gistfile1.txt
>
>
> On commit 61c04572de404e52a655a36752e696bbcb483cf5 (Aug 25).

Re: [PATCH nf-next] netfilter: log: Check param to avoid overflow in nf_log_set

2016-08-28 Thread Feng Gao

On Sun, Aug 28, 2016 at 10:30 PM,   wrote:
> From: Gao Feng 
>
> The nf_log_set is an interface function, so it should do the strict sanity
> check of parameters. Add  one sanity check for pf, it could not exceed
> NFPROTO_NUMPROTO, and print error log when pf is invalid.
>
> Signed-off-by: Gao Feng 
> ---
>  net/netfilter/nf_log.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
> index aa5847a..02ce0b9 100644
> --- a/net/netfilter/nf_log.c
> +++ b/net/netfilter/nf_log.c
> @@ -43,8 +43,10 @@ void nf_log_set(struct net *net, u_int8_t pf, const struct 
> nf_logger *logger)
>  {
> const struct nf_logger *log;
>
> -   if (pf == NFPROTO_UNSPEC)
> +   if (pf == NFPROTO_UNSPEC || pf >= NFPROTO_NUMPROTO) {
> +   pr_err("Wrong pf(%d) for nf log", pf);
> return;
> +   }
>
> mutex_lock(&nf_log_mutex);
> log = nft_log_dereference(net->nf.nf_loggers[pf]);
> --
> 1.9.1
>
>

BTW, another similar interface function "nf_log_register" checks
sanity of param "pf".
So I think nf_log_set also need to check if param "pf" exceeds the valid range.

[PATCH net 7/9] net/mlx5e: Fix memory leak if refreshing TIRs fails

2016-08-28 Thread Saeed Mahameed

From: Kamal Heib 

Free 'in' command object also when mlx5_core_modify_tir fails.

Fixes: 724b2aa15126 ("net/mlx5e: TIRs management refactoring")
Signed-off-by: Kamal Heib 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_common.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_common.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_common.c
index 673043c..9cce153 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_common.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_common.c
@@ -139,7 +139,7 @@ int mlx5e_refresh_tirs_self_loopback_enable(struct 
mlx5_core_dev *mdev)
struct mlx5e_tir *tir;
void *in;
int inlen;
-   int err;
+   int err = 0;
 
inlen = MLX5_ST_SZ_BYTES(modify_tir_in);
in = mlx5_vzalloc(inlen);
@@ -151,10 +151,11 @@ int mlx5e_refresh_tirs_self_loopback_enable(struct 
mlx5_core_dev *mdev)
list_for_each_entry(tir, &mdev->mlx5e_res.td.tirs_list, list) {
err = mlx5_core_modify_tir(mdev, tir->tirn, in, inlen);
if (err)
-   return err;
+   goto out;
}
 
+out:
kvfree(in);
 
-   return 0;
+   return err;
 }
-- 
2.7.4

[PATCH net 3/9] net/mlx5e: Don't post fragmented MPWQE when RQ is disabled

2016-08-28 Thread Saeed Mahameed

ICO (Internal control operations) SQ (Send Queue) is closed/disabled
after RQ (Receive Queue).  After RQ is closed an ICO SQ completion
might post a fragmented MPWQE (Multi Packet Work Queue Element) into
that RQ.

As on regular RQ post, check if we are allowed to post to that
RQ (RQ is enabled). Cleanup in-progress UMR MPWQE on mlx5e_free_rx_descs
if needed.

Fixes: bc77b240b3c5 ('net/mlx5e: Add fragmented memory support for RX multi 
packet WQE')
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 6 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2463eba..e259eaa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -534,6 +534,10 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
__be16 wqe_ix_be;
u16 wqe_ix;
 
+   /* UMR WQE (if in progress) is always at wq->head */
+   if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
+   mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
+
while (!mlx5_wq_ll_is_empty(wq)) {
wqe_ix_be = *wq->tail_next;
wqe_ix= be16_to_cpu(wqe_ix_be);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index fee1e47..b6f8ebb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -506,6 +506,12 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
 
clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
+
+   if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
+   mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
+   return;
+   }
+
mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
rq->stats.mpwqe_frag++;
 
-- 
2.7.4

[PATCH net 6/9] net/mlx5e: Add ethtool counter for TX xmit_more

2016-08-28 Thread Saeed Mahameed

From: Tariq Toukan 

Add a counter in ethtool for the number of times that
TX xmit_more was used.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 1 +
 3 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 297781a..2459c7f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -155,6 +155,7 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->tx_queue_stopped += sq_stats->stopped;
s->tx_queue_wake+= sq_stats->wake;
s->tx_queue_dropped += sq_stats->dropped;
+   s->tx_xmit_more += sq_stats->xmit_more;
s->tx_csum_partial_inner += 
sq_stats->csum_partial_inner;
tx_offload_none += sq_stats->csum_none;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 7b9d8a9..499487c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -70,6 +70,7 @@ struct mlx5e_sw_stats {
u64 tx_queue_stopped;
u64 tx_queue_wake;
u64 tx_queue_dropped;
+   u64 tx_xmit_more;
u64 rx_wqe_err;
u64 rx_mpwqe_filler;
u64 rx_mpwqe_frag;
@@ -101,6 +102,7 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_wake) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_dropped) },
+   { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
@@ -298,6 +300,7 @@ struct mlx5e_sq_stats {
/* commonly accessed in data path */
u64 packets;
u64 bytes;
+   u64 xmit_more;
u64 tso_packets;
u64 tso_bytes;
u64 tso_inner_packets;
@@ -324,6 +327,7 @@ static const struct counter_desc sq_stats_desc[] = {
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, stopped) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, wake) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, dropped) },
+   { MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, xmit_more) },
 };
 
 #define NUM_SW_COUNTERSARRAY_SIZE(sw_stats_desc)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 5f209ad..988eca9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -375,6 +375,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, 
struct sk_buff *skb)
 
sq->stats.packets++;
sq->stats.bytes += num_bytes;
+   sq->stats.xmit_more += skb->xmit_more;
return NETDEV_TX_OK;
 
 dma_unmap_wqe_err:
-- 
2.7.4

[PATCH net 2/9] net/mlx5e: Don't wait for RQ completions on close

2016-08-28 Thread Saeed Mahameed

This will significantly reduce receive queue flush time on interface
down.

Instead of asking the firmware to flush the RQ (Receive Queue) via
asynchronous completions when moved to error, we handle RQ flush
manually (mlx5e_free_rx_descs) same as we did when RQ flush got timed
out.

This will reduce RQs flush time and speedup interface down procedure
(ifconfig down) from 6 sec to 0.3 sec on a 48 cores system.

Moved mlx5e_free_rx_descs en_main.c where it is needed, to keep en_rx.c
free form non critical data path code for better code locality.

Fixes: 6cd392a082de ('net/mlx5e: Handle RQ flush in error cases')
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 37 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 23 ++
 3 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index d63a1b8..26a7ec7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -223,9 +223,8 @@ struct mlx5e_tstamp {
 };
 
 enum {
-   MLX5E_RQ_STATE_POST_WQES_ENABLE,
+   MLX5E_RQ_STATE_FLUSH,
MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS,
-   MLX5E_RQ_STATE_FLUSH_TIMEOUT,
MLX5E_RQ_STATE_AM,
 };
 
@@ -703,7 +702,6 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
 void mlx5e_free_tx_descs(struct mlx5e_sq *sq);
-void mlx5e_free_rx_descs(struct mlx5e_rq *rq);
 
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 65360b1..2463eba 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -431,7 +431,6 @@ static int mlx5e_enable_rq(struct mlx5e_rq *rq, struct 
mlx5e_rq_param *param)
 
MLX5_SET(rqc,  rqc, cqn,rq->cq.mcq.cqn);
MLX5_SET(rqc,  rqc, state,  MLX5_RQC_STATE_RST);
-   MLX5_SET(rqc,  rqc, flush_in_error_en,  1);
MLX5_SET(rqc,  rqc, vsd, priv->params.vlan_strip_disable);
MLX5_SET(wq,   wq,  log_wq_pg_sz,   rq->wq_ctrl.buf.page_shift -
MLX5_ADAPTER_PAGE_SHIFT);
@@ -528,6 +527,23 @@ static int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq)
return -ETIMEDOUT;
 }
 
+static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
+{
+   struct mlx5_wq_ll *wq = &rq->wq;
+   struct mlx5e_rx_wqe *wqe;
+   __be16 wqe_ix_be;
+   u16 wqe_ix;
+
+   while (!mlx5_wq_ll_is_empty(wq)) {
+   wqe_ix_be = *wq->tail_next;
+   wqe_ix= be16_to_cpu(wqe_ix_be);
+   wqe   = mlx5_wq_ll_get_wqe(&rq->wq, wqe_ix);
+   rq->dealloc_wqe(rq, wqe_ix);
+   mlx5_wq_ll_pop(&rq->wq, wqe_ix_be,
+  &wqe->next.next_wqe_index);
+   }
+}
+
 static int mlx5e_open_rq(struct mlx5e_channel *c,
 struct mlx5e_rq_param *param,
 struct mlx5e_rq *rq)
@@ -551,8 +567,6 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (param->am_enabled)
set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
 
-   set_bit(MLX5E_RQ_STATE_POST_WQES_ENABLE, &rq->state);
-
sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
sq->ico_wqe_info[pi].num_wqebbs = 1;
mlx5e_send_nop(sq, true); /* trigger mlx5e_post_rx_wqes() */
@@ -569,23 +583,8 @@ err_destroy_rq:
 
 static void mlx5e_close_rq(struct mlx5e_rq *rq)
 {
-   int tout = 0;
-   int err;
-
-   clear_bit(MLX5E_RQ_STATE_POST_WQES_ENABLE, &rq->state);
+   set_bit(MLX5E_RQ_STATE_FLUSH, &rq->state);
napi_synchronize(&rq->channel->napi); /* prevent mlx5e_post_rx_wqes */
-
-   err = mlx5e_modify_rq_state(rq, MLX5_RQC_STATE_RDY, MLX5_RQC_STATE_ERR);
-   while (!mlx5_wq_ll_is_empty(&rq->wq) && !err &&
-  tout++ < MLX5_EN_QP_FLUSH_MAX_ITER)
-   msleep(MLX5_EN_QP_FLUSH_MSLEEP_QUANT);
-
-   if (err || tout == MLX5_EN_QP_FLUSH_MAX_ITER)
-   set_bit(MLX5E_RQ_STATE_FLUSH_TIMEOUT, &rq->state);
-
-   /* avoid destroying rq before mlx5e_poll_rx_cq() is done with it */
-   napi_synchronize(&rq->channel->napi);
-
cancel_work_sync(&rq->am.work);
 
mlx5e_disable_rq(rq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index bdc9e33..fee1e47 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -595,26 +595,9 @@ void mlx5e_deall

[PATCH net 4/9] net/mlx5e: Don't wait for SQ completions on close

2016-08-28 Thread Saeed Mahameed

Instead of asking the firmware to flush the SQ (Send Queue) via
asynchronous completions when moved to error, we handle SQ flush
manually (mlx5e_free_tx_descs) same as we did when SQ flush got
timed out or on tx_timeout.

This will reduce SQs flush time and speedup interface down procedure.

Moved mlx5e_free_tx_descs to the end of en_tx.c for tx
critical code locality.

Fixes: 29429f3300a3 ('net/mlx5e: Timeout if SQ doesn't flush during close')
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 38 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   | 67 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |  6 +-
 4 files changed, 44 insertions(+), 70 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 26a7ec7..bf722aa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -369,9 +369,8 @@ struct mlx5e_sq_dma {
 };
 
 enum {
-   MLX5E_SQ_STATE_WAKE_TXQ_ENABLE,
+   MLX5E_SQ_STATE_FLUSH,
MLX5E_SQ_STATE_BF_ENABLE,
-   MLX5E_SQ_STATE_TX_TIMEOUT,
 };
 
 struct mlx5e_ico_wqe_info {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e259eaa..297781a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -39,13 +39,6 @@
 #include "eswitch.h"
 #include "vxlan.h"
 
-enum {
-   MLX5_EN_QP_FLUSH_TIMEOUT_MS = 5000,
-   MLX5_EN_QP_FLUSH_MSLEEP_QUANT   = 20,
-   MLX5_EN_QP_FLUSH_MAX_ITER   = MLX5_EN_QP_FLUSH_TIMEOUT_MS /
- MLX5_EN_QP_FLUSH_MSLEEP_QUANT,
-};
-
 struct mlx5e_rq_param {
u32 rqc[MLX5_ST_SZ_DW(rqc)];
struct mlx5_wq_paramwq;
@@ -827,7 +820,6 @@ static int mlx5e_open_sq(struct mlx5e_channel *c,
goto err_disable_sq;
 
if (sq->txq) {
-   set_bit(MLX5E_SQ_STATE_WAKE_TXQ_ENABLE, &sq->state);
netdev_tx_reset_queue(sq->txq);
netif_tx_start_queue(sq->txq);
}
@@ -851,38 +843,20 @@ static inline void netif_tx_disable_queue(struct 
netdev_queue *txq)
 
 static void mlx5e_close_sq(struct mlx5e_sq *sq)
 {
-   int tout = 0;
-   int err;
+   set_bit(MLX5E_SQ_STATE_FLUSH, &sq->state);
+   /* prevent netif_tx_wake_queue */
+   napi_synchronize(&sq->channel->napi);
 
if (sq->txq) {
-   clear_bit(MLX5E_SQ_STATE_WAKE_TXQ_ENABLE, &sq->state);
-   /* prevent netif_tx_wake_queue */
-   napi_synchronize(&sq->channel->napi);
netif_tx_disable_queue(sq->txq);
 
-   /* ensure hw is notified of all pending wqes */
+   /* last doorbell out, godspeed .. */
if (mlx5e_sq_has_room_for(sq, 1))
mlx5e_send_nop(sq, true);
-
-   err = mlx5e_modify_sq(sq, MLX5_SQC_STATE_RDY,
- MLX5_SQC_STATE_ERR, false, 0);
-   if (err)
-   set_bit(MLX5E_SQ_STATE_TX_TIMEOUT, &sq->state);
}
 
-   /* wait till sq is empty, unless a TX timeout occurred on this SQ */
-   while (sq->cc != sq->pc &&
-  !test_bit(MLX5E_SQ_STATE_TX_TIMEOUT, &sq->state)) {
-   msleep(MLX5_EN_QP_FLUSH_MSLEEP_QUANT);
-   if (tout++ > MLX5_EN_QP_FLUSH_MAX_ITER)
-   set_bit(MLX5E_SQ_STATE_TX_TIMEOUT, &sq->state);
-   }
-
-   /* avoid destroying sq before mlx5e_poll_tx_cq() is done with it */
-   napi_synchronize(&sq->channel->napi);
-
-   mlx5e_free_tx_descs(sq);
mlx5e_disable_sq(sq);
+   mlx5e_free_tx_descs(sq);
mlx5e_destroy_sq(sq);
 }
 
@@ -2802,7 +2776,7 @@ static void mlx5e_tx_timeout(struct net_device *dev)
if (!netif_xmit_stopped(netdev_get_tx_queue(dev, i)))
continue;
sched_work = true;
-   set_bit(MLX5E_SQ_STATE_TX_TIMEOUT, &sq->state);
+   set_bit(MLX5E_SQ_STATE_FLUSH, &sq->state);
netdev_err(dev, "TX timeout on queue: %d, SQ: 0x%x, CQ: 0x%x, 
SQ Cons: 0x%x SQ Prod: 0x%x\n",
   i, sq->sqn, sq->cq.mcq.cqn, sq->cc, sq->pc);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index e073bf59..5f209ad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -394,35 +394,6 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct 
net_device *dev)
return mlx5e_sq_xmit(sq, skb);
 }
 
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
-{
-   struct mlx5e_tx_wqe_info *wi;
-   struct sk_buff *skb;
-   u16 ci;
-   int i;
-
-

[PATCH net 1/9] net/mlx5e: Limit UMR length to the device's limitation

2016-08-28 Thread Saeed Mahameed

ConnectX-4 UMR (User Memory Region) MTT translation table offset in WQE
is limited to U16_MAX, before this patch we ignored that limitation and
requested the maximum possible UMR translation length that the netdev
might need (MAX channels * MAX pages per channel).
In case of a system with #cores > 32 and when linear WQE allocation fails,
falling back to using UMR WQEs will cause the RQ (Receive Queue) to get
stuck.

Here we limit UMR length to min(U16_MAX, max required pages) (while
considering the required alignments) on driver load, by default U16_MAX is
sufficient since the default RX rings value guarantees that we are in
range, dynamically (on set_ringparam/set_channels) we will check if the
new required UMR length (num mtts) is still in range, if not, fail the
request.

Fixes: bc77b240b3c5 ('net/mlx5e: Add fragmented memory support for RX multi 
packet WQE')
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 14 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 19 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c| 11 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c  | 12 ++--
 4 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1b495ef..d63a1b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -73,8 +73,12 @@
 #define MLX5_MPWRQ_PAGES_PER_WQE   BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
 #define MLX5_MPWRQ_STRIDES_PER_PAGE(MLX5_MPWRQ_NUM_STRIDES >> \
 MLX5_MPWRQ_WQE_PAGE_ORDER)
-#define MLX5_CHANNEL_MAX_NUM_MTTS (ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8) * \
-  BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW))
+
+#define MLX5_MTT_OCTW(npages) (ALIGN(npages, 8) / 2)
+#define MLX5E_REQUIRED_MTTS(rqs, wqes)\
+   (rqs * wqes * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8))
+#define MLX5E_VALID_NUM_MTTS(num_mtts) (MLX5_MTT_OCTW(num_mtts) <= U16_MAX)
+
 #define MLX5_UMR_ALIGN (2048)
 #define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD  (128)
 
@@ -304,6 +308,7 @@ struct mlx5e_rq {
 
unsigned long  state;
intix;
+   u32mpwqe_mtt_offset;
 
struct mlx5e_rx_am am; /* Adaptive Moderation */
 
@@ -814,11 +819,6 @@ static inline int mlx5e_get_max_num_channels(struct 
mlx5_core_dev *mdev)
 MLX5E_MAX_NUM_CHANNELS);
 }
 
-static inline int mlx5e_get_mtt_octw(int npages)
-{
-   return ALIGN(npages, 8) / 2;
-}
-
 extern const struct ethtool_ops mlx5e_ethtool_ops;
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 extern const struct dcbnl_rtnl_ops mlx5e_dcbnl_ops;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 4a3757e..9cfe408 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -373,6 +373,7 @@ static int mlx5e_set_ringparam(struct net_device *dev,
u16 min_rx_wqes;
u8 log_rq_size;
u8 log_sq_size;
+   u32 num_mtts;
int err = 0;
 
if (param->rx_jumbo_pending) {
@@ -397,6 +398,15 @@ static int mlx5e_set_ringparam(struct net_device *dev,
1 << mlx5_max_log_rq_size(rq_wq_type));
return -EINVAL;
}
+
+   num_mtts = MLX5E_REQUIRED_MTTS(priv->params.num_channels, 
param->rx_pending);
+   if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ &&
+   !MLX5E_VALID_NUM_MTTS(num_mtts)) {
+   netdev_info(dev, "%s: rx_pending (%d) request can't be 
satisfied, try to reduce.\n",
+   __func__, param->rx_pending);
+   return -EINVAL;
+   }
+
if (param->tx_pending < (1 << MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE)) {
netdev_info(dev, "%s: tx_pending (%d) < min (%d)\n",
__func__, param->tx_pending,
@@ -454,6 +464,7 @@ static int mlx5e_set_channels(struct net_device *dev,
unsigned int count = ch->combined_count;
bool arfs_enabled;
bool was_opened;
+   u32 num_mtts;
int err = 0;
 
if (!count) {
@@ -472,6 +483,14 @@ static int mlx5e_set_channels(struct net_device *dev,
return -EINVAL;
}
 
+   num_mtts = MLX5E_REQUIRED_MTTS(count, BIT(priv->params.log_rq_size));
+   if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ &&
+   !MLX5E_VALID_NUM_MTTS(num_mtts)) {
+   netdev_info(dev, "%s: rx count (%d) request can't be satisfied, 
try to reduce.\n",
+   __func__, count);
+   return -EINVAL;
+   }
+
if (priv->params.num_channels == count)
return 0;
 
diff --g

[PATCH net 9/9] net/mlx5: Increase number of ethtool steering priorities

2016-08-28 Thread Saeed Mahameed

From: Maor Gottlieb 

Ethtool has 11 flow tables, each flow table has its own priority.
Increase the number of priorities to be aligned with the number of flow
tables.

Fixes: 1174fce8d141 ('net/mlx5e: Support l3/l4 flow type specs in ethtool flow 
steering')
Signed-off-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 75bb8c8..3d6c1f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -80,7 +80,7 @@
   LEFTOVERS_NUM_PRIOS)
 
 #define ETHTOOL_PRIO_NUM_LEVELS 1
-#define ETHTOOL_NUM_PRIOS 10
+#define ETHTOOL_NUM_PRIOS 11
 #define ETHTOOL_MIN_LEVEL (KERNEL_MIN_LEVEL + ETHTOOL_NUM_PRIOS)
 /* Vlan, mac, ttc, aRFS */
 #define KERNEL_NIC_PRIO_NUM_LEVELS 4
-- 
2.7.4

[PATCH net 5/9] net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ

2016-08-28 Thread Saeed Mahameed

From: Eran Ben Elisha 

The driver RQ has two possible configurations: striding RQ and
non-striding RQ.  Until this patch, the driver always reported the
number of hardware WQEs (ring descriptors). For non striding RQ
configuration, this was OK since we have one WQE per pending packet
For striding RQ, multiple packets can fit into one WQE. For better
user experience we normalize the rx_pending parameter (size of wqe/mtu)
as the average ring size in case of striding RQ.

Fixes: 461017cb006a ('net/mlx5e: Support RX multi-packet WQE ...')
Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 76 +++---
 1 file changed, 67 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 9cfe408..d0cf8fa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -352,15 +352,61 @@ static void mlx5e_get_ethtool_stats(struct net_device 
*dev,
   
sq_stats_desc, j);
 }
 
+static u32 mlx5e_rx_wqes_to_packets(struct mlx5e_priv *priv, int rq_wq_type,
+   int num_wqe)
+{
+   int packets_per_wqe;
+   int stride_size;
+   int num_strides;
+   int wqe_size;
+
+   if (rq_wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
+   return num_wqe;
+
+   stride_size = 1 << priv->params.mpwqe_log_stride_sz;
+   num_strides = 1 << priv->params.mpwqe_log_num_strides;
+   wqe_size = stride_size * num_strides;
+
+   packets_per_wqe = wqe_size /
+ ALIGN(ETH_DATA_LEN, stride_size);
+   return (1 << (order_base_2(num_wqe * packets_per_wqe) - 1));
+}
+
+static u32 mlx5e_packets_to_rx_wqes(struct mlx5e_priv *priv, int rq_wq_type,
+   int num_packets)
+{
+   int packets_per_wqe;
+   int stride_size;
+   int num_strides;
+   int wqe_size;
+   int num_wqes;
+
+   if (rq_wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
+   return num_packets;
+
+   stride_size = 1 << priv->params.mpwqe_log_stride_sz;
+   num_strides = 1 << priv->params.mpwqe_log_num_strides;
+   wqe_size = stride_size * num_strides;
+
+   num_packets = (1 << order_base_2(num_packets));
+
+   packets_per_wqe = wqe_size /
+ ALIGN(ETH_DATA_LEN, stride_size);
+   num_wqes = DIV_ROUND_UP(num_packets, packets_per_wqe);
+   return 1 << (order_base_2(num_wqes));
+}
+
 static void mlx5e_get_ringparam(struct net_device *dev,
struct ethtool_ringparam *param)
 {
struct mlx5e_priv *priv = netdev_priv(dev);
int rq_wq_type = priv->params.rq_wq_type;
 
-   param->rx_max_pending = 1 << mlx5_max_log_rq_size(rq_wq_type);
+   param->rx_max_pending = mlx5e_rx_wqes_to_packets(priv, rq_wq_type,
+1 << 
mlx5_max_log_rq_size(rq_wq_type));
param->tx_max_pending = 1 << MLX5E_PARAMS_MAXIMUM_LOG_SQ_SIZE;
-   param->rx_pending = 1 << priv->params.log_rq_size;
+   param->rx_pending = mlx5e_rx_wqes_to_packets(priv, rq_wq_type,
+1 << 
priv->params.log_rq_size);
param->tx_pending = 1 << priv->params.log_sq_size;
 }
 
@@ -370,6 +416,9 @@ static int mlx5e_set_ringparam(struct net_device *dev,
struct mlx5e_priv *priv = netdev_priv(dev);
bool was_opened;
int rq_wq_type = priv->params.rq_wq_type;
+   u32 rx_pending_wqes;
+   u32 min_rq_size;
+   u32 max_rq_size;
u16 min_rx_wqes;
u8 log_rq_size;
u8 log_sq_size;
@@ -386,20 +435,29 @@ static int mlx5e_set_ringparam(struct net_device *dev,
__func__);
return -EINVAL;
}
-   if (param->rx_pending < (1 << mlx5_min_log_rq_size(rq_wq_type))) {
+
+   min_rq_size = mlx5e_rx_wqes_to_packets(priv, rq_wq_type,
+  1 << 
mlx5_min_log_rq_size(rq_wq_type));
+   max_rq_size = mlx5e_rx_wqes_to_packets(priv, rq_wq_type,
+  1 << 
mlx5_max_log_rq_size(rq_wq_type));
+   rx_pending_wqes = mlx5e_packets_to_rx_wqes(priv, rq_wq_type,
+  param->rx_pending);
+
+   if (param->rx_pending < min_rq_size) {
netdev_info(dev, "%s: rx_pending (%d) < min (%d)\n",
__func__, param->rx_pending,
-   1 << mlx5_min_log_rq_size(rq_wq_type));
+   min_rq_size);
return -EINVAL;
}
-   if (param->rx_pending > (1 << mlx5_max_log_rq_size(rq_wq_type))) {
+   if (param->rx_pending > max_rq_size) {
netdev

[PATCH net 8/9] net/mlx5: Add error prints when validate ETS failed

2016-08-28 Thread Saeed Mahameed

From: Eran Ben Elisha 

Upon set ETS failure due to user invalid input, add error prints to
specify the exact error to the user.

Fixes: cdcf11212b22 ('net/mlx5e: Validate BW weight values of ETS')
Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
index caa9a3c..762af16 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
@@ -127,29 +127,40 @@ int mlx5e_dcbnl_ieee_setets_core(struct mlx5e_priv *priv, 
struct ieee_ets *ets)
return mlx5_set_port_tc_bw_alloc(mdev, tc_tx_bw);
 }
 
-static int mlx5e_dbcnl_validate_ets(struct ieee_ets *ets)
+static int mlx5e_dbcnl_validate_ets(struct net_device *netdev,
+   struct ieee_ets *ets)
 {
int bw_sum = 0;
int i;
 
/* Validate Priority */
for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
-   if (ets->prio_tc[i] >= MLX5E_MAX_PRIORITY)
+   if (ets->prio_tc[i] >= MLX5E_MAX_PRIORITY) {
+   netdev_err(netdev,
+  "Failed to validate ETS: priority value 
greater than max(%d)\n",
+   MLX5E_MAX_PRIORITY);
return -EINVAL;
+   }
}
 
/* Validate Bandwidth Sum */
for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
if (ets->tc_tsa[i] == IEEE_8021QAZ_TSA_ETS) {
-   if (!ets->tc_tx_bw[i])
+   if (!ets->tc_tx_bw[i]) {
+   netdev_err(netdev,
+  "Failed to validate ETS: BW 0 is 
illegal\n");
return -EINVAL;
+   }
 
bw_sum += ets->tc_tx_bw[i];
}
}
 
-   if (bw_sum != 0 && bw_sum != 100)
+   if (bw_sum != 0 && bw_sum != 100) {
+   netdev_err(netdev,
+  "Failed to validate ETS: BW sum is illegal\n");
return -EINVAL;
+   }
return 0;
 }
 
@@ -159,7 +170,7 @@ static int mlx5e_dcbnl_ieee_setets(struct net_device 
*netdev,
struct mlx5e_priv *priv = netdev_priv(netdev);
int err;
 
-   err = mlx5e_dbcnl_validate_ets(ets);
+   err = mlx5e_dbcnl_validate_ets(netdev, ets);
if (err)
return err;
 
-- 
2.7.4

[PATCH net 0/9] Mellanox 100G mlx5 fixes 2016-08-29

2016-08-28 Thread Saeed Mahameed

Hi Dave,

This series contains some bug fixes for the mlx5 core and mlx5
ethernet driver.

>From Saeed, Fix UMR to consider hardware translation table field
size limitation when calculating the maximum number of MTTs required
by the driver.  Three patches to speed-up netdevice close time by
serializing channel (SQs & RQs) destruction rather than issuing and
waiting for hardware interrupts to free them.

>From Eran, Fix ethtool ring parameter reporting for striding RQ layout.
Add error prints on ETS validation failure.

>From Kamal, Fix memory leak on error flow.

>From Maor, Fix ethtool steering priorities number.

For -stable of 4.7.y:
  net/mlx5e: Limit UMR length to the device's limitation
  net/mlx5e: Don't wait for RQ completions on close 
  net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
  net/mlx5e: Don't wait for SQ completions on close
  net/mlx5e: Add ethtool counter for TX xmit_more

Thanks,
Saeed.

Eran Ben Elisha (2):
  net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
  net/mlx5: Add error prints when validate ETS failed

Kamal Heib (1):
  net/mlx5e: Fix memory leak if refreshing TIRs fails

Maor Gottlieb (1):
  net/mlx5: Increase number of ethtool steering priorities

Saeed Mahameed (4):
  net/mlx5e: Limit UMR length to the device's limitation
  net/mlx5e: Don't wait for RQ completions on close
  net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
  net/mlx5e: Don't wait for SQ completions on close

Tariq Toukan (1):
  net/mlx5e: Add ethtool counter for TX xmit_more

 drivers/net/ethernet/mellanox/mlx5/core/en.h   | 21 +++--
 .../net/ethernet/mellanox/mlx5/core/en_common.c|  7 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 21 +++--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 93 --
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 91 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 41 --
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |  4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 68 
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  6 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  2 +-
 10 files changed, 209 insertions(+), 145 deletions(-)

-- 
2.7.4

[PATCH net-next 0/3] strp: Generalize stream parser to work with other socket types

2016-08-28 Thread Tom Herbert

Add a read_sock protocol operation function that allows something like
tcp_read_sock to be called for other protocol types.

Specific changes in this patch set:
  - Add read_sock function to proto_ops. This has the same signature as
tcp_read_sock. sk_read_actor_t is also defined in net.h.
  - Set peek_len and read_sock proto_op functions for TCPv4 and TCPv6
stream ops.
  - Remove references to tcp in strparser.
  - Call peek_len and read_sock operations from strparser instead of
calling TCP specific functions.

Tom Herbert (3):
  net: Add read_sock proto_op
  tcp: Set read_sock and peek_len proto_ops
  kcm: Remove TCP specific references from kcm and strparser

 include/linux/net.h   |  6 ++
 include/net/strparser.h   |  2 +-
 include/net/tcp.h |  4 ++--
 net/ipv4/af_inet.c|  2 ++
 net/ipv4/tcp.c|  6 ++
 net/ipv6/af_inet6.c   |  2 ++
 net/kcm/kcmsock.c | 30 +
 net/strparser/strparser.c | 48 ---
 8 files changed, 61 insertions(+), 39 deletions(-)

-- 
2.8.0.rc2

[PATCH net-next 3/3] kcm: Remove TCP specific references from kcm and strparser

2016-08-28 Thread Tom Herbert

kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.

Signed-off-by: Tom Herbert 
---
 include/net/strparser.h   |  2 +-
 net/kcm/kcmsock.c | 30 +
 net/strparser/strparser.c | 48 ---
 3 files changed, 43 insertions(+), 37 deletions(-)

diff --git a/include/net/strparser.h b/include/net/strparser.h
index 91fa0b9..0c28ad9 100644
--- a/include/net/strparser.h
+++ b/include/net/strparser.h
@@ -137,6 +137,6 @@ void strp_stop(struct strparser *strp);
 void strp_check_rcv(struct strparser *strp);
 int strp_init(struct strparser *strp, struct sock *csk,
  struct strp_callbacks *cb);
-void strp_tcp_data_ready(struct strparser *strp);
+void strp_data_ready(struct strparser *strp);
 
 #endif /* __NET_STRPARSER_H_ */
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index eb731ca..2632ac7 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 unsigned int kcm_net_id;
@@ -340,7 +339,7 @@ static void unreserve_rx_kcm(struct kcm_psock *psock,
 }
 
 /* Lower sock lock held */
-static void psock_tcp_data_ready(struct sock *sk)
+static void psock_data_ready(struct sock *sk)
 {
struct kcm_psock *psock;
 
@@ -348,7 +347,7 @@ static void psock_tcp_data_ready(struct sock *sk)
 
psock = (struct kcm_psock *)sk->sk_user_data;
if (likely(psock))
-   strp_tcp_data_ready(&psock->strp);
+   strp_data_ready(&psock->strp);
 
read_unlock_bh(&sk->sk_callback_lock);
 }
@@ -392,7 +391,7 @@ static int kcm_read_sock_done(struct strparser *strp, int 
err)
return err;
 }
 
-static void psock_tcp_state_change(struct sock *sk)
+static void psock_state_change(struct sock *sk)
 {
/* TCP only does a POLLIN for a half close. Do a POLLHUP here
 * since application will normally not poll with POLLIN
@@ -402,7 +401,7 @@ static void psock_tcp_state_change(struct sock *sk)
report_csk_error(sk, EPIPE);
 }
 
-static void psock_tcp_write_space(struct sock *sk)
+static void psock_write_space(struct sock *sk)
 {
struct kcm_psock *psock;
struct kcm_mux *mux;
@@ -1383,19 +1382,12 @@ static int kcm_attach(struct socket *sock, struct 
socket *csock,
struct list_head *head;
int index = 0;
struct strp_callbacks cb;
-
-   if (csock->ops->family != PF_INET &&
-   csock->ops->family != PF_INET6)
-   return -EINVAL;
+   int err;
 
csk = csock->sk;
if (!csk)
return -EINVAL;
 
-   /* Only support TCP for now */
-   if (csk->sk_protocol != IPPROTO_TCP)
-   return -EINVAL;
-
psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL);
if (!psock)
return -ENOMEM;
@@ -1409,7 +1401,11 @@ static int kcm_attach(struct socket *sock, struct socket 
*csock,
cb.parse_msg = kcm_parse_func_strparser;
cb.read_sock_done = kcm_read_sock_done;
 
-   strp_init(&psock->strp, csk, &cb);
+   err = strp_init(&psock->strp, csk, &cb);
+   if (err) {
+   kmem_cache_free(kcm_psockp, psock);
+   return err;
+   }
 
sock_hold(csk);
 
@@ -1418,9 +1414,9 @@ static int kcm_attach(struct socket *sock, struct socket 
*csock,
psock->save_write_space = csk->sk_write_space;
psock->save_state_change = csk->sk_state_change;
csk->sk_user_data = psock;
-   csk->sk_data_ready = psock_tcp_data_ready;
-   csk->sk_write_space = psock_tcp_write_space;
-   csk->sk_state_change = psock_tcp_state_change;
+   csk->sk_data_ready = psock_data_ready;
+   csk->sk_write_space = psock_write_space;
+   csk->sk_state_change = psock_state_change;
write_unlock_bh(&csk->sk_callback_lock);
 
/* Finished initialization, now add the psock to the MUX. */
diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 4ecfc10..5c7549b 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 
 static struct workqueue_struct *strp_wq;
 
@@ -80,9 +79,16 @@ static void strp_parser_err(struct strparser *strp, int err,
strp->cb.abort_parser(strp, err);
 }
 
+static inline int strp_peek_len(struct strparser *strp)
+{
+   struct socket *sock = strp->sk->sk_socket;
+
+   return sock->ops->peek_len(sock);
+}
+
 /* Lower socket lock held */
-static int strp_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
-unsigned int orig_offset, size_t orig_len)
+static int strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
+unsigned int orig_offset, size_t orig_len)
 {
struct

[PATCH net-next 1/3] net: Add read_sock proto_op

2016-08-28 Thread Tom Herbert

Add new function in proto_ops structure. This includes moving the
typedef got sk_read_actor into net.h and removing the definition from
tcp.h.

Signed-off-by: Tom Herbert 
---
 include/linux/net.h | 6 ++
 include/net/tcp.h   | 2 --
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index b9f0ff4..cd0c8bd 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -128,6 +129,9 @@ struct page;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct sk_buff;
+typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *,
+  unsigned int, size_t);
 
 struct proto_ops {
int family;
@@ -186,6 +190,8 @@ struct proto_ops {
   struct pipe_inode_info *pipe, size_t 
len, unsigned int flags);
int (*set_peek_off)(struct sock *sk, int val);
int (*peek_len)(struct socket *sock);
+   int (*read_sock)(struct sock *sk, read_descriptor_t *desc,
+sk_read_actor_t recv_actor);
 };
 
 #define DECLARE_SOCKADDR(type, dst, src)   \
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 25d64f6..d5a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -603,8 +603,6 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock 
*tp, int pktsize)
 void tcp_get_info(struct sock *, struct tcp_info *);
 
 /* Read 'sendfile()'-style from a TCP socket */
-typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *,
-   unsigned int, size_t);
 int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
  sk_read_actor_t recv_actor);
 
-- 
2.8.0.rc2

[PATCH net-next 2/3] tcp: Set read_sock and peek_len proto_ops

2016-08-28 Thread Tom Herbert

In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).

Signed-off-by: Tom Herbert 
---
 include/net/tcp.h   | 2 ++
 net/ipv4/af_inet.c  | 2 ++
 net/ipv4/tcp.c  | 6 ++
 net/ipv6/af_inet6.c | 2 ++
 4 files changed, 12 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5a..a5af6be 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1848,6 +1848,8 @@ static inline int tcp_inq(struct sock *sk)
return answ;
 }
 
+int tcp_peek_len(struct socket *sock);
+
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
 {
u16 segs_in;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 989a362..e94b47b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -916,6 +916,8 @@ const struct proto_ops inet_stream_ops = {
.mmap  = sock_no_mmap,
.sendpage  = inet_sendpage,
.splice_read   = tcp_splice_read,
+   .read_sock = tcp_read_sock,
+   .peek_len  = tcp_peek_len,
 #ifdef CONFIG_COMPAT
.compat_setsockopt = compat_sock_common_setsockopt,
.compat_getsockopt = compat_sock_common_getsockopt,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f1a9a0a..60a4388 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1570,6 +1570,12 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
*desc,
 }
 EXPORT_SYMBOL(tcp_read_sock);
 
+int tcp_peek_len(struct socket *sock)
+{
+   return tcp_inq(sock->sk);
+}
+EXPORT_SYMBOL(tcp_peek_len);
+
 /*
  * This routine copies from a sock struct into the user buffer.
  *
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index b454055..46ad699 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -545,6 +545,8 @@ const struct proto_ops inet6_stream_ops = {
.mmap  = sock_no_mmap,
.sendpage  = inet_sendpage,
.splice_read   = tcp_splice_read,
+   .read_sock = tcp_read_sock,
+   .peek_len  = tcp_peek_len,
 #ifdef CONFIG_COMPAT
.compat_setsockopt = compat_sock_common_setsockopt,
.compat_getsockopt = compat_sock_common_getsockopt,
-- 
2.8.0.rc2

Re: [PATCH v2] ipv6: Use inbound ifaddr as source addresses for ICMPv6 errors

2016-08-28 Thread Eli Cooper

Hello,

On 2016/8/29 1:18, Guillaume Nault wrote:
> On Sun, Aug 28, 2016 at 11:34:06AM +0800, Eli Cooper wrote:
>> According to RFC 1885 2.2(c), the source address of ICMPv6
>> errors in response to forwarded packets should be set to the
>> unicast address of the forwarding interface in order to be helpful
>> in diagnosis.
>>
> FWIW, this behaviour has been deprecated ten years ago by RFC 4443:
> "The address SHOULD be chosen according to the rules that would be used
>  to select the source address for any other packet originated by the
>  node, given the destination address of the packet."
>
> The door is left open for other address selection algorithms but, IMHO,
> changing kernel's behaviour is better justified by real use cases
> than by obsolete RFCs.

I agree, sorry for the obsoleted RFC. This is actually motivated by a
real use case: Say a Linux box is acting as a router that forwards
packets with policy routing from two local networks to two uplinks,
respectively. An outside host from is performing traceroute to a host on
one of the LAN. If the kernel's default route is via the other LAN's
uplink, it will send ICMPv6 packets with the source address that has
nothing to do with the network in question, yet the message probably
will reach the outside host.

Here using the address of inbound or exiting interface as source address
is evidently "a more informative choice." I surmise this is the reason
why the comment reads "Force OUTPUT device used as source address" when
dealing with hop limit exceeded packets in ip6_forward(), although not
effectively so. The current behaviour not only confuses diagnosis, but
also might be undesirable if the addresses of the networks are best kept
secret from each other.

Thanks,
Eli

Re: [PATCH] net: ethernet: renesas: sh_eth: do not access POST registers if not exist

2016-08-28 Thread Sergei Shtylyov


Hello.

   Oh, and I'll have to correct your language and terminology. :-/
Should be "if they don't exist" in the subject.

On 08/26/2016 11:01 PM, Chris Brandt wrote:


The RZ/A1 has a TSU, but since it only has one Ethernet port, it does not
have POST registers. Therefore, if you try to write to register index
TSU_POST1 (which will be  because it does not exist),


   It's not a register index which is 0x but the register offset (fetched 
from a layout table using the index).



it will either panic or corrupt memory elsewhere.

Reported-by: Daniel Palmer 
Signed-off-by: Chris Brandt 

[...]

MBR, Sergei

Re: [PATCH] net: ethernet: renesas: sh_eth: do not access POST registers if not exist

2016-08-28 Thread Sergei Shtylyov


Hello.

On 08/26/2016 11:01 PM, Chris Brandt wrote:


The RZ/A1 has a TSU, but since it only has one Ethernet port, it does not
have POST registers.


   I'm not sure the reason is having one port... do you have the old SH 
manuals somewhere? :-)



Therefore, if you try to write to register index
TSU_POST1 (which will be  because it does not exist), it will either
panic or corrupt memory elsewhere.


   The true reason of that is that Ben Hutchings wasn't consistent with 
handling of SH_ETH_OFFSET_INVALID: he didn't add WARN_ON() to 
sh_eth_tsu_{read|wrte}() and friends. Maybe you can do this?



Reported-by: Daniel Palmer 
Signed-off-by: Chris Brandt 
---
 drivers/net/ethernet/renesas/sh_eth.c | 7 +++
 drivers/net/ethernet/renesas/sh_eth.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 1f8240a..850a13c 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -532,6 +532,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
.no_ade = 1,
.hw_crc = 1,
.tsu= 1,
+   .tsu_no_post= 1,


   The rest of the code seems to use sh_eth_is_rz_fast_ether() to 
differentiate the limited TSU implementation in the RZ/A1 SoC -- see 
sh_eth_tsu_init(). I'd prefer if you follow this suit. Either that or give 
this bitfield a different name.




.shift_rd0  = 1,
 };

@@ -2460,6 +2461,9 @@ static void sh_eth_tsu_enable_cam_entry_post(struct 
net_device *ndev,
u32 tmp;
void *reg_offset;

+   if (mdp->cd->tsu_no_post)
+   return;
+
reg_offset = sh_eth_tsu_get_post_reg_offset(mdp, entry);


   I'd check check for SH_ETH_OFFSET_INVALID in the above function and return 
NULL if so; then we can check for NULL here...



tmp = ioread32(reg_offset);
iowrite32(tmp | sh_eth_tsu_get_post_bit(mdp, entry), reg_offset);
@@ -2472,6 +2476,9 @@ static bool sh_eth_tsu_disable_cam_entry_post(struct 
net_device *ndev,
u32 post_mask, ref_mask, tmp;
void *reg_offset;

+   if (mdp->cd->tsu_no_post)
+   return false;
+
reg_offset = sh_eth_tsu_get_post_reg_offset(mdp, entry);


   ... and here.


post_mask = sh_eth_tsu_get_post_mask(entry);
ref_mask = sh_eth_tsu_get_post_bit(mdp, entry) & ~post_mask;

[...]

MBR, Sergei

[PATCH] i40e: avoid potential null pointer dereference when assigning len

2016-08-28 Thread Colin King

From: Colin Ian King 

There is a sanitcy check for desc being null in the first line of
function i40evf_debug_aq.  However, before that, aq_desc is cast from
desc, and aq_desc is being dereferenced on the assignment of len, so
this could be a potential null pointer deference.  Fix this by moving
the initialization of len to the code block where len is being used
and hence at this point we know it is OK to dereference aq_desc.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/intel/i40evf/i40e_common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_common.c 
b/drivers/net/ethernet/intel/i40evf/i40e_common.c
index 4db0c03..7953c13 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_common.c
@@ -302,7 +302,6 @@ void i40evf_debug_aq(struct i40e_hw *hw, enum 
i40e_debug_mask mask, void *desc,
   void *buffer, u16 buf_len)
 {
struct i40e_aq_desc *aq_desc = (struct i40e_aq_desc *)desc;
-   u16 len = le16_to_cpu(aq_desc->datalen);
u8 *buf = (u8 *)buffer;
u16 i = 0;
 
@@ -326,6 +325,8 @@ void i40evf_debug_aq(struct i40e_hw *hw, enum 
i40e_debug_mask mask, void *desc,
   le32_to_cpu(aq_desc->params.external.addr_low));
 
if ((buffer != NULL) && (aq_desc->datalen != 0)) {
+   u16 len = le16_to_cpu(aq_desc->datalen);
+
i40e_debug(hw, mask, "AQ CMD Buffer:\n");
if (buf_len < len)
len = buf_len;
-- 
2.9.3

Re: [PATCH v2] ipv6: Use inbound ifaddr as source addresses for ICMPv6 errors

2016-08-28 Thread Guillaume Nault

On Sun, Aug 28, 2016 at 11:34:06AM +0800, Eli Cooper wrote:
> According to RFC 1885 2.2(c), the source address of ICMPv6
> errors in response to forwarded packets should be set to the
> unicast address of the forwarding interface in order to be helpful
> in diagnosis.
> 
FWIW, this behaviour has been deprecated ten years ago by RFC 4443:
"The address SHOULD be chosen according to the rules that would be used
 to select the source address for any other packet originated by the
 node, given the destination address of the packet."

The door is left open for other address selection algorithms but, IMHO,
changing kernel's behaviour is better justified by real use cases
than by obsolete RFCs.

[PATCH v3 2/5] clk: gxbb: expose MPLL2 clock for use by DT

2016-08-28 Thread Martin Blumenstingl

This exposes the MPLL2 clock as this is one of the input clocks of the
ethernet controller's internal mux.

Signed-off-by: Martin Blumenstingl 
---
 drivers/clk/meson/gxbb.h  | 2 +-
 include/dt-bindings/clock/gxbb-clkc.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/clk/meson/gxbb.h b/drivers/clk/meson/gxbb.h
index 217df51..3606e875 100644
--- a/drivers/clk/meson/gxbb.h
+++ b/drivers/clk/meson/gxbb.h
@@ -183,7 +183,7 @@
 /* CLKID_CLK81 */
 #define CLKID_MPLL0  13
 #define CLKID_MPLL1  14
-#define CLKID_MPLL2  15
+/* CLKID_MPLL2 */
 #define CLKID_DDR16
 #define CLKID_DOS17
 #define CLKID_ISA18
diff --git a/include/dt-bindings/clock/gxbb-clkc.h 
b/include/dt-bindings/clock/gxbb-clkc.h
index 7d41864..244ea6e 100644
--- a/include/dt-bindings/clock/gxbb-clkc.h
+++ b/include/dt-bindings/clock/gxbb-clkc.h
@@ -8,6 +8,7 @@
 #define CLKID_CPUCLK   1
 #define CLKID_FCLK_DIV24
 #define CLKID_CLK8112
+#define CLKID_MPLL215
 #define CLKID_ETH  36
 #define CLKID_SD_EMMC_A94
 #define CLKID_SD_EMMC_B95
-- 
2.9.3

[PATCH v3 3/5] stmmac: introduce get_stmmac_bsp_priv() helper

2016-08-28 Thread Martin Blumenstingl

From: Joachim Eastwood 

Create a helper to retrive dwmac private data from a dev
pointer. This is useful in PM callbacks and driver remove.

Signed-off-by: Joachim Eastwood 
Tested-by: Martin Blumenstingl 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
index ffeb8d9..64e147f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
@@ -30,4 +30,12 @@ int stmmac_get_platform_resources(struct platform_device 
*pdev,
 int stmmac_pltfr_remove(struct platform_device *pdev);
 extern const struct dev_pm_ops stmmac_pltfr_pm_ops;
 
+static inline void *get_stmmac_bsp_priv(struct device *dev)
+{
+   struct net_device *ndev = dev_get_drvdata(dev);
+   struct stmmac_priv *priv = netdev_priv(ndev);
+
+   return priv->plat->bsp_priv;
+}
+
 #endif /* __STMMAC_PLATFORM_H__ */
-- 
2.9.3

[PATCH v3 5/5] ARM64: dts: meson-gxbb: use the new GXBB DWMAC glue driver

2016-08-28 Thread Martin Blumenstingl

The Amlogic reference driver uses the "mc_val" devicetree property to
configure the PRG_ETHERNET_ADDR0 register. Unfortunately it uses magic
values for this configuration.
According to the datasheet the PRG_ETHERNET_ADDR0 register is at address
0xc8834108. However, the reference driver uses 0xc8834540 instead.
According to my tests, the value from the reference driver is correct.

No changes are required to the board dts files because the only
required configuration option is the phy-mode, which had to be
configured correctly before as well.

Signed-off-by: Martin Blumenstingl 
---
 arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi 
b/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi
index 4f42316..ab817d3 100644
--- a/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi
+++ b/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi
@@ -373,13 +373,15 @@
};
 
ethmac: ethernet@c941 {
-   compatible = "amlogic,meson6-dwmac", "snps,dwmac";
+   compatible = "amlogic,meson-gxbb-dwmac", "snps,dwmac";
reg = <0x0 0xc941 0x0 0x1
   0x0 0xc8834540 0x0 0x4>;
interrupts = <0 8 1>;
interrupt-names = "macirq";
-   clocks = <&clkc CLKID_ETH>;
-   clock-names = "stmmaceth";
+   clocks = <&clkc CLKID_ETH>,
+<&clkc CLKID_FCLK_DIV2>,
+<&clkc CLKID_MPLL2>;
+   clock-names = "stmmaceth", "clkin0", "clkin1";
phy-mode = "rgmii";
status = "disabled";
};
-- 
2.9.3

[PATCH v3 0/5] meson: Meson8b and GXBB DWMAC glue driver

2016-08-28 Thread Martin Blumenstingl

This adds a DWMAC glue driver for the PRG_ETHERNET registers found in
Meson8b and GXBB SoCs. Based on the "old" meson6b-dwmac glue driver
the register layout is completely different.
Thus I introduced a separate driver.


Changes since v2:
- fixed unloading the glue driver when built as module. This pulls in a
  patch from Joachim Eastwood (thanks) to get our private data structure
  (bsp_priv).

Joachim Eastwood (1):
  stmmac: introduce get_stmmac_bsp_priv() helper

Martin Blumenstingl (4):
  net: dt-bindings: Document the new Meson8b and GXBB DWMAC bindings
  clk: gxbb: expose MPLL2 clock for use by DT
  net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC
  ARM64: dts: meson-gxbb: use the new GXBB DWMAC glue driver

 .../devicetree/bindings/net/meson-dwmac.txt|  45 ++-
 arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi|   8 +-
 drivers/clk/meson/gxbb.h   |   2 +-
 drivers/net/ethernet/stmicro/stmmac/Makefile   |   2 +-
 .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 327 +
 .../net/ethernet/stmicro/stmmac/stmmac_platform.h  |   8 +
 include/dt-bindings/clock/gxbb-clkc.h  |   1 +
 7 files changed, 380 insertions(+), 13 deletions(-)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c

-- 
2.9.3

[PATCH v3 4/5] net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC

2016-08-28 Thread Martin Blumenstingl

The Ethernet controller available in Meson8b and GXBB SoCs is a Synopsys
DesignWare MAC IP core which is already supported by the stmmac driver.

In addition to the standard stmmac driver some Meson8b / GXBB specific
registers have to be configured for the PHY clocks. These SoC specific
registers are called PRG_ETHERNET_ADDR0 and PRG_ETHERNET_ADDR1 in the
datasheet.
These registers are not backwards compatible with those on Meson 6b,
which is why a new glue driver is introduced. This worked for many
boards because the bootloader programs the PRG_ETHERNET registers
correctly. Additionally the meson6-dwmac driver only sets bit 1 of
PRG_ETHERNET_ADDR0 which (according to the datasheet) is only used
during reset.

Currently all configuration values can be determined automatically,
based on the configured phy-mode (which is mandatory for the stmmac
driver). If required the tx-delay and the mux clock (so it supports
the MPLL2 clock as well) can be made configurable in the future.

Signed-off-by: Martin Blumenstingl 
Tested-by: Kevin Hilman 
---
 drivers/net/ethernet/stmicro/stmmac/Makefile   |   2 +-
 .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 327 +
 2 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c

diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 44b630c..f77edb9 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -9,7 +9,7 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o 
ring_mode.o  \
 obj-$(CONFIG_STMMAC_PLATFORM)  += stmmac-platform.o
 obj-$(CONFIG_DWMAC_IPQ806X)+= dwmac-ipq806x.o
 obj-$(CONFIG_DWMAC_LPC18XX)+= dwmac-lpc18xx.o
-obj-$(CONFIG_DWMAC_MESON)  += dwmac-meson.o
+obj-$(CONFIG_DWMAC_MESON)  += dwmac-meson.o dwmac-meson8b.o
 obj-$(CONFIG_DWMAC_ROCKCHIP)   += dwmac-rk.o
 obj-$(CONFIG_DWMAC_SOCFPGA)+= dwmac-altr-socfpga.o
 obj-$(CONFIG_DWMAC_STI)+= dwmac-sti.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
new file mode 100644
index 000..0f185e4
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -0,0 +1,327 @@
+/*
+ * Amlogic Meson S805/S905 DWMAC glue layer
+ *
+ * Copyright (C) 20016 Martin Blumenstingl 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "stmmac_platform.h"
+
+#define PRG_ETH0   0x0
+
+#define PRG_ETH0_RGMII_MODEBIT(0)
+
+/* mux to choose between fclk_div2 (bit unset) and mpll2 (bit set) */
+#define PRG_ETH0_CLK_M250_SEL_SHIFT4
+#define PRG_ETH0_CLK_M250_SEL_MASK GENMASK(4, 4)
+
+#define PRG_ETH0_TXDLY_SHIFT   5
+#define PRG_ETH0_TXDLY_MASKGENMASK(6, 5)
+#define PRG_ETH0_TXDLY_OFF (0x0 << PRG_ETH0_TXDLY_SHIFT)
+#define PRG_ETH0_TXDLY_QUARTER (0x1 << PRG_ETH0_TXDLY_SHIFT)
+#define PRG_ETH0_TXDLY_HALF(0x2 << PRG_ETH0_TXDLY_SHIFT)
+#define PRG_ETH0_TXDLY_THREE_QUARTERS  (0x3 << PRG_ETH0_TXDLY_SHIFT)
+
+/* divider for the result of m250_sel */
+#define PRG_ETH0_CLK_M250_DIV_SHIFT7
+#define PRG_ETH0_CLK_M250_DIV_WIDTH3
+
+/* divides the result of m25_sel by either 5 (bit unset) or 10 (bit set) */
+#define PRG_ETH0_CLK_M25_DIV_SHIFT 10
+#define PRG_ETH0_CLK_M25_DIV_WIDTH 1
+
+#define PRG_ETH0_INVERTED_RMII_CLK BIT(11)
+#define PRG_ETH0_TX_AND_PHY_REF_CLKBIT(12)
+
+#define MUX_CLK_NUM_PARENTS2
+
+struct meson8b_dwmac {
+   struct platform_device  *pdev;
+
+   void __iomem*regs;
+
+   phy_interface_t phy_mode;
+
+   struct clk_mux  m250_mux;
+   struct clk  *m250_mux_clk;
+   struct clk  *m250_mux_parent[MUX_CLK_NUM_PARENTS];
+
+   struct clk_divider  m250_div;
+   struct clk  *m250_div_clk;
+
+   struct clk_divider  m25_div;
+   struct clk  *m25_div_clk;
+};
+
+static void meson8b_dwmac_mask_bits(struct meson8b_dwmac *dwmac, u32 reg,
+   u32 mask, u32 value)
+{
+   u32 data;
+
+   data = readl(dwmac->regs + reg);
+   data &= ~mask;
+   data |= (value & mask);
+
+   writel(data, dwmac->regs + reg);
+}
+
+static int meson8b_init_clk(struct meson8b_dwmac *dwmac)
+{
+   struct clk_init_data init;
+   int i, ret;
+   struct device *dev = &dwmac->pdev->dev;
+   char clk_name[32];
+

[PATCH v3 1/5] net: dt-bindings: Document the new Meson8b and GXBB DWMAC bindings

2016-08-28 Thread Martin Blumenstingl

This patch adds the documentation for the DWMAC ethernet controller
found in Amlogic Meson 8b (S805) and GXBB (S905) SoCs.
The main difference between the Meson6 glue is that different registers
(with different layout) are used.

Signed-off-by: Martin Blumenstingl 
Acked-by: Rob Herring 
---
 .../devicetree/bindings/net/meson-dwmac.txt| 45 ++
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt 
b/Documentation/devicetree/bindings/net/meson-dwmac.txt
index ec633d7..89e62dd 100644
--- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt
@@ -1,18 +1,32 @@
 * Amlogic Meson DWMAC Ethernet controller
 
 The device inherits all the properties of the dwmac/stmmac devices
-described in the file net/stmmac.txt with the following changes.
+described in the file stmmac.txt in the current directory with the
+following changes.
 
-Required properties:
+Required properties on all platforms:
 
-- compatible: should be "amlogic,meson6-dwmac" along with "snps,dwmac"
- and any applicable more detailed version number
- described in net/stmmac.txt
+- compatible:  Depending on the platform this should be one of:
+   - "amlogic,meson6-dwmac"
+   - "amlogic,meson8b-dwmac"
+   - "amlogic,meson-gxbb-dwmac"
+   Additionally "snps,dwmac" and any applicable more
+   detailed version number described in net/stmmac.txt
+   should be used.
 
-- reg: should contain a register range for the dwmac controller and
-   another one for the Amlogic specific configuration
+- reg: The first register range should be the one of the DWMAC
+   controller. The second range is is for the Amlogic specific
+   configuration (for example the PRG_ETHERNET register range
+   on Meson8b and newer)
 
-Example:
+Required properties on Meson8b and newer:
+- clock-names: Should contain the following:
+   - "stmmaceth" - see stmmac.txt
+   - "clkin0" - first parent clock of the internal mux
+   - "clkin1" - second parent clock of the internal mux
+
+
+Example for Meson6:
 
ethmac: ethernet@c941 {
compatible = "amlogic,meson6-dwmac", "snps,dwmac";
@@ -23,3 +37,18 @@ Example:
clocks = <&clk81>;
clock-names = "stmmaceth";
}
+
+Example for GXBB:
+   ethmac: ethernet@c941 {
+   compatible = "amlogic,meson-gxbb-dwmac", "snps,dwmac";
+   reg = <0x0 0xc941 0x0 0x1>,
+   <0x0 0xc8834540 0x0 0x8>;
+   interrupts = <0 8 1>;
+   interrupt-names = "macirq";
+   clocks = <&clkc CLKID_ETH>,
+   <&clkc CLKID_FCLK_DIV2>,
+   <&clkc CLKID_MPLL2>;
+   clock-names = "stmmaceth", "clkin0", "clkin1";
+   phy-mode = "rgmii";
+   status = "disabled";
+   };
-- 
2.9.3

Re: [PATCH v2 1/4] net: dt-bindings: Document the new Meson8b and GXBB DWMAC bindings

2016-08-28 Thread Martin Blumenstingl

On Mon, Aug 22, 2016 at 5:25 PM, Arnd Bergmann  wrote:
> It really depends on the kind of SoC. Some may have a suboptimal
> binding, on some others there may be a distinct register area that
> just contains a few additional registers for the dwmac.
the dwmac PHY configuration registers (2x32bit) on the GXBB SoC are
part of the "periphs" region/module. This is already defined as
"simple-bus" in meson-gxbb.dtsi, see [0]
On Meson8b this is slightly different: there is no specific "periphs"
region - there the dwmac PHY configuration registers are directly
located in the cbus region at a slightly different offset than on the
GXBB SoCs.

In the future we might need a third memory region because the latest
reference kernel contains some more PHY configuration registers on
newer SoCs (GXL = S905X).

Please let me know if you're OK with the dts definition in it's
current state - or let me know how you would like to change it.

PS: I will re-send the patches in a v3 in a few minutes because that
fixes a bug during module unload.

Regards,
Martin

[0] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi#n217

Re: [PATCH v3 net-next 1/1] net_sched: Introduce skbmod action

2016-08-28 Thread Eric Dumazet

On Sun, 2016-08-28 at 08:19 -0400, Jamal Hadi Salim wrote:
> From: Jamal Hadi Salim 

...

> +static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
> +   struct tcf_result *res)
> +{
> + struct tcf_skbmod *d = to_skbmod(a);
> +
> + spin_lock(&d->tcf_lock);
> + tcf_lastuse_update(&d->tcf_tm);
> + bstats_update(&d->tcf_bstats, skb);
> +
> + if (d->flags & SKBMOD_F_DMAC)
> + ether_addr_copy(eth_hdr(skb)->h_dest, d->eth_dst);
> + if (d->flags & SKBMOD_F_SMAC)
> + ether_addr_copy(eth_hdr(skb)->h_source, d->eth_src);
> + if (d->flags & SKBMOD_F_ETYPE)
> + eth_hdr(skb)->h_proto = d->eth_type;
> + if (d->flags & SKBMOD_F_SWAPMAC) {
> + u8 tmpaddr[ETH_ALEN];
> + /*XXX: I am sure we can come up with something more efficient */
> + ether_addr_copy(tmpaddr, eth_hdr(skb)->h_dest);
> + ether_addr_copy(eth_hdr(skb)->h_dest, eth_hdr(skb)->h_source);
> + ether_addr_copy(eth_hdr(skb)->h_source, tmpaddr);
> + }
> +
> + spin_unlock(&d->tcf_lock);
> + return d->tcf_action;
> +}


Adding an action with a spinlock held in fast path in 2016 is
a way to tell people : It is a toy, do not use it for real.

Sorry guys. Friends do not let friends do that anymore.

Re: [net-next PATCH] e1000: add initial XDP support

2016-08-28 Thread William Tu

Hi,

Reading through the patch, I found some minor typos below.

On Sat, Aug 27, 2016 at 12:11 AM, John Fastabend
 wrote:
> From: Alexei Starovoitov 
>
> This patch adds initial support for XDP on e1000 driver. Note e1000
> driver does not support page recycling in general which could be
> added as a further improvement. However for XDP_DROP and XDP_XMIT

I think you mean XDP_PASS instead of XDP_XMIT?

> the xdp code paths will recycle pages.
>
> This patch includes the rcu_read_lock/rcu_read_unlock pair noted by
> Brenden Blanco in another pending patch.
>
>   net/mlx4_en: protect ring->xdp_prog with rcu_read_lock
>
> CC: William Tu 
> Signed-off-by: Alexei Starovoitov 
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h  |1
>  drivers/net/ethernet/intel/e1000/e1000_main.c |  168 
> -
>  2 files changed, 165 insertions(+), 4 deletions(-)
>
> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +unsigned int len,
> +struct net_device *netdev,
> +struct e1000_adapter *adapter)
> +{
> +   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +   struct e1000_hw *hw = &adapter->hw;
> +   struct e1000_tx_ring *tx_ring;
> +
> +   if (len > E1000_MAX_DATA_PER_TXD)
> +   return;
> +
> +   /* e1000 only support a single txq at the moment so the queue is being
> +* shared with stack. To support this requires locking to ensure the
> +* stack and XPD are not running at the same time. Devices would
> +* multiple queues should allocate a separate queue space.
> +*/

XPD --> XDP
Devices would --> with?

> +   HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +   tx_ring = adapter->tx_ring;
> +
> +   if (E1000_DESC_UNUSED(tx_ring) < 2)
> +   return;
> +
> +   e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +
> +   e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +   writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +   mmiowb();
> +
> +   HARD_TX_UNLOCK(netdev, txq);
> +}
> +
>  #define NUM_REGS 38 /* 1 based count */
>  static void e1000_regdump(struct e1000_adapter *adapter)
>  {
> @@ -4142,6 +4240,22 @@ static struct sk_buff *e1000_alloc_rx_skb(struct 
> e1000_adapter *adapter,
> return skb;
>  }
>
> +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> +unsigned int length)
> +{
> +   struct xdp_buff xdp;
> +   int ret;
> +
> +   xdp.data = data;
> +   xdp.data_end = data + length;
> +
> +   rcu_read_lock();
> +   ret = BPF_PROG_RUN(prog, (void *)&xdp);
> +   rcu_read_unlock();
> +
> +   return ret;
> +}
> +
>  /**
>   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
>   * @adapter: board private structure
> @@ -4160,12 +4274,15 @@ static bool e1000_clean_jumbo_rx_irq(struct 
> e1000_adapter *adapter,
> struct pci_dev *pdev = adapter->pdev;
> struct e1000_rx_desc *rx_desc, *next_rxd;
> struct e1000_rx_buffer *buffer_info, *next_buffer;
> +   struct bpf_prog *prog;
> u32 length;
> unsigned int i;
> int cleaned_count = 0;
> bool cleaned = false;
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>
> +   rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> +   prog = READ_ONCE(adapter->prog);

If having rcu_read_lock() here, do we still need another in e1000_call_bpf()?


> i = rx_ring->next_to_clean;
> rx_desc = E1000_RX_DESC(*rx_ring, i);
> buffer_info = &rx_ring->buffer_info[i];
> @@ -4188,15 +4305,57 @@ static bool e1000_clean_jumbo_rx_irq(struct 
> e1000_adapter *adapter,
> prefetch(next_rxd);
>
> next_buffer = &rx_ring->buffer_info[i];
> -
> cleaned = true;
> cleaned_count++;
> +   length = le16_to_cpu(rx_desc->length);
> +
> +   if (prog) {
> +   struct page *p = buffer_info->rxbuf.page;
> +   dma_addr_t dma = buffer_info->dma;
> +   int act;
> +
> +   if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> +   /* attached bpf disallows larger than page
> +* packets, so this is hw error or corruption
> +*/
> +   pr_info_once("%s buggy !eop\n", netdev->name);
> +   break;
> +   }
> +   if (unlikely(rx_ring->rx_skb_top)) {
> +   pr_info_once("%s ring resizing bug\n",
> +netdev->name);
> +   break;
>

Re: [PATCH] Documentation: networking: dsa: Remove platform device TODO

2016-08-28 Thread Andrew Lunn

On Sat, Aug 27, 2016 at 03:34:20PM -0700, Florian Fainelli wrote:
> Since commit 83c0afaec7b7 ("net: dsa: Add new binding implementation"),
> the shortcomings of the dsa platform device have been addressed, remove
> that TODO item.
> 
> Signed-off-by: Florian Fainelli 

Acked-by: Andrew Lunn 

Thanks :-)

Andrew

Re: [PATCH v3 net-next 1/1] net_sched: Introduce skbmod action

2016-08-28 Thread Alexei Starovoitov

On Sun, Aug 28, 2016 at 08:19:16AM -0400, Jamal Hadi Salim wrote:
> --- /dev/null
> +++ b/include/uapi/linux/tc_act/tc_skbmod.h
> @@ -0,0 +1,49 @@
> +/*
> + * Copyright (c) 2016, Jamal Hadi Salim
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along 
> with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 
> Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.

the address is incorrect.
That's why it's recommended to skip this section in all headers.
First paragraph in the above 'This program is ... GPL' would have been enough.

[PATCH nf-next] netfilter: log: Check param to avoid overflow in nf_log_set

2016-08-28 Thread fgao

From: Gao Feng 

The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Add  one sanity check for pf, it could not exceed
NFPROTO_NUMPROTO, and print error log when pf is invalid.

Signed-off-by: Gao Feng 
---
 net/netfilter/nf_log.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index aa5847a..02ce0b9 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -43,8 +43,10 @@ void nf_log_set(struct net *net, u_int8_t pf, const struct 
nf_logger *logger)
 {
const struct nf_logger *log;
 
-   if (pf == NFPROTO_UNSPEC)
+   if (pf == NFPROTO_UNSPEC || pf >= NFPROTO_NUMPROTO) {
+   pr_err("Wrong pf(%d) for nf log", pf);
return;
+   }
 
mutex_lock(&nf_log_mutex);
log = nft_log_dereference(net->nf.nf_loggers[pf]);
-- 
1.9.1

[PATCH net iproute2] devlink: Add e-switch support

2016-08-28 Thread Or Gerlitz

Implement kernel devlink e-switch interface. Currently we allow
to get and set the device e-switch mode.

Signed-off-by: Or Gerlitz 
Signed-off-by: Roi Dayan 
---

Hi Stephen,

The patch is rebased over the net-next branch of iproute2 which has the 
4.8 UAPI headers devlink update, but is targeted to 4.8 (net) as the 
relevant functionality was merged in 4.8-rc1

Roi && Or.

 devlink/devlink.c  | 122 +
 man/man8/devlink-dev.8 |  34 ++
 2 files changed, 156 insertions(+)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index 84fa51e..d69fc6b 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -26,6 +26,9 @@
 #include "mnlg.h"
 #include "json_writer.h"
 
+#define ESWITCH_MODE_LEGACY "legacy"
+#define ESWITCH_MODE_SWITCHDEV "switchdev"
+
 #define pr_err(args...) fprintf(stderr, ##args)
 #define pr_out(args...) fprintf(stdout, ##args)
 #define pr_out_sp(num, args...)\
@@ -128,6 +131,7 @@ static void ifname_map_free(struct ifname_map *ifname_map)
 #define DL_OPT_SB_THTYPE   BIT(8)
 #define DL_OPT_SB_TH   BIT(9)
 #define DL_OPT_SB_TC   BIT(10)
+#define DL_OPT_ESWITCH_MODEBIT(11)
 
 struct dl_opts {
uint32_t present; /* flags of present items */
@@ -143,6 +147,7 @@ struct dl_opts {
enum devlink_sb_threshold_type sb_pool_thtype;
uint32_t sb_threshold;
uint16_t sb_tc_index;
+   enum devlink_eswitch_mode eswitch_mode;
 };
 
 struct dl {
@@ -297,6 +302,9 @@ static int attr_cb(const struct nlattr *attr, void *data)
if (type == DEVLINK_ATTR_SB_OCC_MAX &&
mnl_attr_validate(attr, MNL_TYPE_U32) < 0)
return MNL_CB_ERROR;
+   if (type == DEVLINK_ATTR_ESWITCH_MODE &&
+   mnl_attr_validate(attr, MNL_TYPE_U16) < 0)
+   return MNL_CB_ERROR;
tb[type] = attr;
return MNL_CB_OK;
 }
@@ -661,6 +669,19 @@ static int threshold_type_get(const char *typestr,
return 0;
 }
 
+static int eswitch_mode_get(const char *typestr, enum devlink_eswitch_mode 
*p_mode)
+{
+   if (strcmp(typestr, ESWITCH_MODE_LEGACY) == 0) {
+   *p_mode = DEVLINK_ESWITCH_MODE_LEGACY;
+   } else if (strcmp(typestr, ESWITCH_MODE_SWITCHDEV) == 0) {
+   *p_mode = DEVLINK_ESWITCH_MODE_SWITCHDEV;
+   } else {
+   pr_err("Unknown eswitch mode \"%s\"\n", typestr);
+   return -EINVAL;
+   }
+   return 0;
+}
+
 static int dl_argv_parse(struct dl *dl, uint32_t o_required,
 uint32_t o_optional)
 {
@@ -770,6 +791,17 @@ static int dl_argv_parse(struct dl *dl, uint32_t 
o_required,
if (err)
return err;
o_found |= DL_OPT_SB_TC;
+   } else if (dl_argv_match(dl, "mode") &&
+  (o_all & DL_OPT_ESWITCH_MODE)) {
+   const char *typestr;
+   dl_arg_inc(dl);
+   err = dl_argv_str(dl, &typestr);
+   if (err)
+   return err;
+   err = eswitch_mode_get(typestr, &opts->eswitch_mode);
+   if (err)
+   return err;
+   o_found |= DL_OPT_ESWITCH_MODE;
} else {
pr_err("Unknown option \"%s\"\n", dl_argv(dl));
return -EINVAL;
@@ -823,6 +855,12 @@ static int dl_argv_parse(struct dl *dl, uint32_t 
o_required,
pr_err("TC index option expected.\n");
return -EINVAL;
}
+
+   if ((o_required & DL_OPT_ESWITCH_MODE) && !(o_found & 
DL_OPT_ESWITCH_MODE)) {
+   pr_err("E-Switch mode option expected.\n");
+   return -EINVAL;
+   }
+
return 0;
 }
 
@@ -866,6 +904,9 @@ static void dl_opts_put(struct nlmsghdr *nlh, struct dl *dl)
if (opts->present & DL_OPT_SB_TC)
mnl_attr_put_u16(nlh, DEVLINK_ATTR_SB_TC_INDEX,
 opts->sb_tc_index);
+   if (opts->present & DL_OPT_ESWITCH_MODE)
+   mnl_attr_put_u16(nlh, DEVLINK_ATTR_ESWITCH_MODE,
+opts->eswitch_mode);
 }
 
 static int dl_argv_parse_put(struct nlmsghdr *nlh, struct dl *dl,
@@ -1149,6 +1190,84 @@ static void pr_out_section_end(struct dl *dl)
}
 }
 
+static const char *eswitch_mode_name(uint32_t mode)
+{
+   switch (mode) {
+   case DEVLINK_ESWITCH_MODE_LEGACY: return ESWITCH_MODE_LEGACY;
+   case DEVLINK_ESWITCH_MODE_SWITCHDEV: return ESWITCH_MODE_SWITCHDEV;
+   default: return "";
+   }
+}
+
+static void pr_out_eswitch(struct dl *dl, struct nlattr **tb)
+{
+   __pr_out_handle_start(dl, tb, true, false);
+
+   if (tb[DEVLINK_ATTR_ESWITCH_MODE])
+   pr_out_str(dl, "mode",
+  
eswitch_mode_name(mnl_attr_get_u16

Hello Beautiful,

2016-08-28 Thread Jack

Good day dear, i hope this mail meets you well? my name is Jack, from the U.S. 
I know this may seem inappropriate so i ask for your forgiveness but i wish to 
get to know you better, if I may be so bold. I consider myself an easy-going 
man, adventurous, honest and fun loving person but I am currently looking for a 
relationship in which I will feel loved. I promise to answer any question that 
you may want to ask me...all i need is just your attention and the chance to 
know you more.

Please tell me more about yourself, if you do not mind. Hope to hear back from 
you soon.

Jack.

[iproute2 3/3] police: bug fix man page

2016-08-28 Thread Jamal Hadi Salim

From: Roman Mashak 

Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 
---
 man/man8/tc-police.8 | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/man/man8/tc-police.8 b/man/man8/tc-police.8
index 5c5a632..620c288 100644
--- a/man/man8/tc-police.8
+++ b/man/man8/tc-police.8
@@ -26,10 +26,10 @@ police - policing action
 
 .ti -8
 .IR CONTROL " :="
-.BI conform-exceed " EXCEEDACT\fR[\fB/\fIEXCEEDACT"
+.BI conform-exceed " EXCEEDACT\fR[\fB/\fINOTEXCEEDACT"
 
 .ti -8
-.IR EXCEEDACT " := { "
+.IR EXCEEDACT/NOTEXCEEDACT " := { "
 .BR pipe " | " ok " | " reclassify " | " drop " | " continue " }"
 .SH DESCRIPTION
 The
@@ -94,11 +94,9 @@ Fine-tune the in-kernel packet rate estimator.
 are time values and control the frequency in which samples are taken and over
 what timespan an average is built.
 .TP
-.BI conform-exceed " EXCEEDACT\fR[\fB/\fIEXCEEDACT\fR]"
-Define how to handle packets which exceed (and, if the second
-.I EXCEEDACT
-is given, also those who don't), the configured bandwidth limit. Possible 
values
-are:
+.BI conform-exceed " EXCEEDACT\fR[\fB/\fINOTEXCEEDACT\fR]"
+Define how to handle packets which exceed or conform the
+configured bandwidth limit. Possible values are:
 .RS
 .IP continue
 Don't do anything, just continue with the next action in line.
-- 
1.9.1

[iproute2 2/3] police: improve usage message

2016-08-28 Thread Jamal Hadi Salim

From: Roman Mashak 

Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 
---
 tc/m_police.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tc/m_police.c b/tc/m_police.c
index d7fa8f6..226e20e 100644
--- a/tc/m_police.c
+++ b/tc/m_police.c
@@ -36,11 +36,12 @@ static void usage(void)
 {
fprintf(stderr, "Usage: ... police rate BPS burst BYTES[/BYTES] [ mtu 
BYTES[/BYTES] ]\n");
fprintf(stderr, "[ peakrate BPS ] [ avrate BPS ] [ 
overhead BYTES ]\n");
-   fprintf(stderr, "[ linklayer TYPE ] [ ACTIONTERM ]\n");
+   fprintf(stderr, "[ linklayer TYPE ] [ CONTROL ]\n");
 
-   fprintf(stderr, "New Syntax ACTIONTERM := conform-exceed 
[/NOTEXCEEDACT]\n");
-   fprintf(stderr, "Where: *EXCEEDACT := pipe | ok | reclassify | drop | 
continue\n");
-   fprintf(stderr, "Where:  pipe is only valid for new syntax\n");
+   fprintf(stderr, "Where: CONTROL := conform-exceed 
[/NOTEXCEEDACT]\n");
+   fprintf(stderr, "  Define how to handle packets which 
exceed ()\n");
+   fprintf(stderr, "  or conform () the 
configured bandwidth limit.\n");
+   fprintf(stderr, "   EXCEEDACT/NOTEXCEEDACT := { pipe | ok | 
reclassify | drop | continue }\n");
exit(-1);
 }
 
-- 
1.9.1

[iproute2 1/3] police: add extra space to improve police result printing

2016-08-28 Thread Jamal Hadi Salim

From: Roman Mashak 

Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 
---
 tc/m_police.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/m_police.c b/tc/m_police.c
index f0b179f..d7fa8f6 100644
--- a/tc/m_police.c
+++ b/tc/m_police.c
@@ -322,7 +322,7 @@ int print_police(struct action_util *a, FILE *f, struct 
rtattr *arg)
if (tb[TCA_POLICE_RESULT]) {
__u32 action = rta_getattr_u32(tb[TCA_POLICE_RESULT]);
 
-   fprintf(f, "/%s", action_n2a(action));
+   fprintf(f, "/%s ", action_n2a(action));
} else
fprintf(f, " ");
 
-- 
1.9.1

Re: [net-next PATCH] e1000: add initial XDP support

2016-08-28 Thread Jamal Hadi Salim


On 16-08-27 03:11 AM, John Fastabend wrote:

From: Alexei Starovoitov 

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However for XDP_DROP and XDP_XMIT
the xdp code paths will recycle pages.

This patch includes the rcu_read_lock/rcu_read_unlock pair noted by
Brenden Blanco in another pending patch.

  net/mlx4_en: protect ring->xdp_prog with rcu_read_lock


Do you have any perf numbers of drops of this vs tc drop at ingress?
single or multiple cpus.

cheers,
jamal

[PATCH v3 net-next 1/1] net_sched: Introduce skbmod action

2016-08-28 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbmod.h|  34 +
 include/uapi/linux/tc_act/tc_skbmod.h |  49 +++
 net/sched/Kconfig |  11 ++
 net/sched/Makefile|   1 +
 net/sched/act_skbmod.c| 257 ++
 5 files changed, 352 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 000..8ea0b25
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, see .
+ *
+ * Author: Jamal Hadi Salim 
+ */
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include 
+#include 
+
+struct tcf_skbmod {
+   struct tc_actioncommon;
+   u64 flags; /*up to 64 types of operations; extend if needed */
+   u8  eth_dst[ETH_ALEN];
+   u16 eth_type;
+   u8  eth_src[ETH_ALEN];
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h 
b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 000..a1fdff5
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Author: Jamal Hadi Salim
+ */
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include 
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC  0x1
+#define SKBMOD_F_SMAC  0x2
+#define SKBMOD_F_ETYPE 0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+   tc_gen;
+   __u64 flags;
+};
+
+enum {
+   TCA_SKBMOD_UNSPEC,
+   TCA_SKBMOD_TM,
+   TCA_SKBMOD_PARMS,
+   TCA_SKBMOD_DMAC,
+   TCA_SKBMOD_SMAC,
+   TCA_SKBMOD_ETYPE,
+   TCA_SKBMOD_PAD,
+   __TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index ccf931b..34b556d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
  To compile this code as a module, choose M here: the
  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+tristate "skb data modification action"
+depends on NET_CLS_ACT
+---help---
+ Say Y here to allow modification of skb data
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_skbmod.
+
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends

Re: [PATCH net 0/2] ppp: fix deadlock upon recursive xmit

2016-08-28 Thread Feng Gao

On Sun, Aug 28, 2016 at 4:20 AM, Guillaume Nault  wrote:
> This series fixes the issue reported by Feng where packets looping
> through a ppp device makes the module deadlock:
> https://marc.info/?l=linux-netdev&m=147134567319038&w=2
>
> The problem can occur on virtual interfaces (e.g. PPP over L2TP, or
> PPPoE on vxlan devices), when a PPP packet is routed back to the PPP
> interface.
>
> PPP's xmit path isn't reentrant, so patch #1 uses a per-cpu variable
> to detect and break recursion. Patch #2 sets the NETIF_F_LLTX flag to
> avoid lock inversion issues between ppp and txqueue locks.
>
> There are multiple entry points to the PPP xmit path. This series has
> been tested with lockdep and should address recursion issues no matter
> how the packet entered the path.
>
>
> A similar issue in L2TP is not covered by this series:
> l2tp_xmit_skb() also isn't reentrant, and it can be called as part of
> PPP's xmit path (pppol2tp_xmit()), or directly from the L2TP socket
> (l2tp_ppp_sendmsg()). If a packet is sent by l2tp_ppp_sendmsg() and
> routed to the parent PPP interface, then it's going to hit
> l2tp_xmit_skb() again.
>
> Breaking recursion as done in ppp_generic is not enough, because we'd
> still have a lock inversion issue (locking in l2tp_xmit_skb() can
> happen before or after locking in ppp_generic). The best approach would
> be to use the ip_tunnel functions and remove the socket locking in
> l2tp_xmit_skb(). But that'd be something for net-next.
>
>
> BTW, I hope the commit messages aren't too long. Just let me know if I
> should trim something.
>
>
> Guillaume Nault (2):
>   ppp: avoid dealock on recursive xmit
>   ppp: declare PPP devices as LLTX
>
>  drivers/net/ppp/ppp_generic.c | 54 
> +--
>  1 file changed, 42 insertions(+), 12 deletions(-)
>
> --
> 2.9.3
>

I am learning your codes. It is better than my solution :))

Best Regards
Feng

Re: [PATCH v2 net-next 1/1] net_sched: Introduce skbmod action

2016-08-28 Thread Jamal Hadi Salim


On 16-08-28 08:13 AM, Jamal Hadi Salim wrote:



+#if 0
+   if (lflags & SKBMOD_F_SWAPMAC)
+   if ((lflags & SKBMOD_F_DMAC) || (lflags & SKBMOD_F_SMAC))
+   return -EINVAL;
+#endif



Sorry - left over from earlier. Will send v3 shortly.

cheers,
jamal

[PATCH v2 net-next 1/1] net_sched: Introduce skbmod action

2016-08-28 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype where you made a mistake in a few odd octets.
Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows mod-ing basic ethernet header.

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbmod.h|  34 +
 include/uapi/linux/tc_act/tc_skbmod.h |  49 +++
 net/sched/Kconfig |  11 ++
 net/sched/Makefile|   1 +
 net/sched/act_skbmod.c| 263 ++
 5 files changed, 358 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 000..8ea0b25
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, see .
+ *
+ * Author: Jamal Hadi Salim 
+ */
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include 
+#include 
+
+struct tcf_skbmod {
+   struct tc_actioncommon;
+   u64 flags; /*up to 64 types of operations; extend if needed */
+   u8  eth_dst[ETH_ALEN];
+   u16 eth_type;
+   u8  eth_src[ETH_ALEN];
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h 
b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 000..a1fdff5
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Author: Jamal Hadi Salim
+ */
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include 
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC  0x1
+#define SKBMOD_F_SMAC  0x2
+#define SKBMOD_F_ETYPE 0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+   tc_gen;
+   __u64 flags;
+};
+
+enum {
+   TCA_SKBMOD_UNSPEC,
+   TCA_SKBMOD_TM,
+   TCA_SKBMOD_PARMS,
+   TCA_SKBMOD_DMAC,
+   TCA_SKBMOD_SMAC,
+   TCA_SKBMOD_ETYPE,
+   TCA_SKBMOD_PAD,
+   __TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index ccf931b..34b556d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
  To compile this code as a module, choose M here: the
  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+tristate "skb data modification action"
+depends on NET_CLS_ACT
+---help---
+ Say Y here to allow modification of skb data
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_skbmod.
+
 config NET_ACT_IFE
 tristate "Inter-FE action based o

[PATCH] cxgb4/cxgb4vf: fix spelling mistake "provissioned" -> "provisioned"

2016-08-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in dev_warn message.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index f2951bf..100b2cc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2378,7 +2378,7 @@ static void size_nports_qsets(struct adapter *adapter)
 */
pmask_nports = hweight32(adapter->params.vfres.pmask);
if (pmask_nports < adapter->params.nports) {
-   dev_warn(adapter->pdev_dev, "only using %d of %d provissioned"
+   dev_warn(adapter->pdev_dev, "only using %d of %d provisioned"
 " virtual interfaces; limited by Port Access Rights"
 " mask %#x\n", pmask_nports, adapter->params.nports,
 adapter->params.vfres.pmask);
-- 
2.9.3

[PATCH] net: ucc_geth: fix spelling mistake "propperty" -> "property"

2016-08-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in dev_warn message.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/freescale/ucc_geth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index 5bf1ade..186ef8f 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -3756,7 +3756,7 @@ static int ucc_geth_probe(struct platform_device* ofdev)
return -EINVAL;
}
if ((*prop < QE_CLK_NONE) || (*prop > QE_CLK24)) {
-   pr_err("invalid rx-clock propperty\n");
+   pr_err("invalid rx-clock property\n");
return -EINVAL;
}
ug_info->uf_info.rx_clock = *prop;
-- 
2.9.3

[PATCH] wan/fsl_ucc_hdlc: fix spelling mistake "prameter" -> "parameter"

2016-08-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in dev_err message.

Signed-off-by: Colin Ian King 
---
 drivers/net/wan/fsl_ucc_hdlc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index 6f04445..5fbf83d 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -162,7 +162,7 @@ static int uhdlc_init(struct ucc_hdlc_private *priv)
ALIGNMENT_OF_UCC_HDLC_PRAM);
 
if (priv->ucc_pram_offset < 0) {
-   dev_err(priv->dev, "Can not allocate MURAM for hdlc 
prameter.\n");
+   dev_err(priv->dev, "Can not allocate MURAM for hdlc 
parameter.\n");
ret = -ENOMEM;
goto free_tx_bd;
}
-- 
2.9.3

kcm: use-after-free in fput of kcm socket

2016-08-28 Thread Dmitry Vyukov

Hello,

The following program triggers use-after-free:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 

int main()
{
  int fd = syscall(SYS_socket, 0x29ul, 0x5ul, 0x0ul, 0, 0, 0);
  syscall(SYS_ioctl, fd, 0x89e2ul, 0x20a98000ul, 0, 0, 0);
  return 0;
}


[  367.240184] 
==
[  367.240784] BUG: KASAN: use-after-free in __fput+0x65a/0x780 at
addr 880069bc4b30
[  367.241034] Read of size 2 by task a.out/4045
[  367.241034] CPU: 3 PID: 4045 Comm: a.out Not tainted 4.8.0-rc3+ #34
[  367.241034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  367.241034]  884b8280 880038fb7bc0 82d1b1d9
00622e00
[  367.241034]  fbfff1097050 88003e198900 880069bc4b00
880069bc4ec0
[  367.241034]  880069bc4b30 859e90a0 880038fb7be8
817da1fc
[  367.241034] Call Trace:
[  367.241034]  [] dump_stack+0x12e/0x185
[  367.241034]  [] ? sock_release+0x1d0/0x1d0
[  367.241034]  [] kasan_object_err+0x1c/0x70
[  367.241034]  [] kasan_report_error+0x1ae/0x490
[  367.241034]  [] ? sock_release+0x1d0/0x1d0
[  367.241034]  [] __asan_report_load2_noabort+0x3e/0x40
[  367.241034]  [] ? __fput+0x65a/0x780
[  367.241034]  [] __fput+0x65a/0x780
[  367.241034]  [] fput+0x15/0x20
[  367.241034]  [] task_work_run+0xf3/0x170
[  367.241034]  [] do_exit+0x868/0x2c10
[  367.241034]  [] ? sock_ioctl+0x1db/0x3d0
[  367.241034]  [] ? sock_do_ioctl+0xb0/0xb0
[  367.241034]  [] ? do_vfs_ioctl+0x430/0x1080
[  367.241034]  [] ? mm_update_next_owner+0x640/0x640
[  367.241034]  [] ? ioctl_preallocate+0x210/0x210
[  367.241034]  [] ? bad_area+0x69/0x80
[  367.241034]  [] ? exit_to_usermode_loop+0x3e/0x210
[  367.241034]  [] ? entry_SYSCALL_64_fastpath+0x5/0xc1
[  367.241034]  [] do_group_exit+0x108/0x330
[  367.241034]  [] SyS_exit_group+0x1d/0x20
[  367.241034]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[  367.241034] Object at 880069bc4b00, in cache sock_inode_cache size: 960
[  367.241034] Allocated:
[  367.241034] PID = 4045
[  367.241034]  [] save_stack_trace+0x26/0x50
[  367.241034]  [] save_stack+0x46/0xd0
[  367.241034]  [] kasan_kmalloc+0xad/0xe0
[  367.241034]  [] kasan_slab_alloc+0x12/0x20
[  367.241034]  [] kmem_cache_alloc+0x12b/0x710
[  367.241034]  [] sock_alloc_inode+0x1d/0x250
[  367.241034]  [] alloc_inode+0x61/0x180
[  367.241034]  [] new_inode_pseudo+0x17/0xe0
[  367.241034]  [] sock_alloc+0x41/0x280
[  367.241034]  [] kcm_ioctl+0x9b3/0x13e0
[  367.241034]  [] sock_do_ioctl+0x65/0xb0
[  367.241034]  [] sock_ioctl+0x2d2/0x3d0
[  367.241034]  [] do_vfs_ioctl+0x18c/0x1080
[  367.241034]  [] SyS_ioctl+0x8f/0xc0
[  367.241034]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[  367.241034] Freed:
[  367.241034] PID = 4045
[  367.241034]  [] save_stack_trace+0x26/0x50
[  367.241034]  [] save_stack+0x46/0xd0
[  367.241034]  [] kasan_slab_free+0x72/0xc0
[  367.241034]  [] kmem_cache_free+0x76/0x300
[  367.241034]  [] sock_destroy_inode+0x56/0x70
[  367.241034]  [] destroy_inode+0xc7/0x130
[  367.241034]  [] evict+0x329/0x500
[  367.241034]  [] iput+0x495/0x930
[  367.241034]  [] sock_release+0x164/0x1d0
[  367.241034]  [] sock_close+0x16/0x20
[  367.241034]  [] __fput+0x236/0x780
[  367.241034]  [] fput+0x15/0x20
[  367.241034]  [] task_work_run+0xf3/0x170
[  367.241034]  [] do_exit+0x868/0x2c10
[  367.241034]  [] do_group_exit+0x108/0x330
[  367.241034]  [] SyS_exit_group+0x1d/0x20
[  367.241034]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[  367.241034] Memory state around the buggy address:
[  367.241034]  880069bc4a00: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  367.241034]  880069bc4a80: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  367.241034] >880069bc4b00: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[  367.241034]  ^
[  367.241034]  880069bc4b80: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[  367.241034]  880069bc4c00: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[  367.241034] 
==


It is then followed by a bunch of other bugs, full log is here:
https://gist.githubusercontent.com/dvyukov/b9884388bee40b792ae7900928358484/raw/ace2fa242468d584fa61bf753a5891faa71b0932/gistfile1.txt


On commit 61c04572de404e52a655a36752e696bbcb483cf5 (Aug 25).

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-28 Thread Mickaël Salaün



On 28/08/2016 10:13, Andy Lutomirski wrote:
> On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
>>
>>
>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
 On 27/08/2016 20:06, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>> As said above, Landlock will not run an eBPF programs when not strictly
>> needed. Attaching to a cgroup will have the same performance impact as
>> attaching to a process hierarchy.
>
> Having a prog per cgroup per lsm_hook is the only scalable way I
> could come up with. If you see another way, please propose.
> current->seccomp.landlock_prog is not the answer.

 Hum, I don't see the difference from a performance point of view between
 a cgroup-based or a process hierarchy-based system.

 Maybe a better option should be to use an array of pointers with N
 entries, one for each supported hook, instead of a unique pointer list?
>>>
>>> yes, clearly array dereference is faster than link list walk.
>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>> Every time the task is creted then. What to do on the fork? That
>>> will require changes all over. Then the obvious optimization would be
>>> to share this allocated array of prog pointers across multiple tasks...
>>> and little by little this new facility will look like cgroup.
>>> Hence the suggestion to put this array into cgroup from the start.
>>
>> I see your point :)
>>
>>>
 Anyway, being able to attach an LSM hook program to a cgroup thanks to
 the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
 to use a process hierarchy). The downside will be to handle an LSM hook
 program which is not triggered by a seccomp-filter, but this should be
 needed anyway to handle interruptions.
>>>
>>> what do you mean 'not triggered by seccomp' ?
>>> You're not suggesting that this lsm has to enable seccomp to be functional?
>>> imo that's non starter due to overhead.
>>
>> Yes, for now, it is triggered by a new seccomp filter return value
>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>> be needed but could be useful to bind a seccomp filter security policy
>> with a Landlock one. Waiting for Kees's point of view…
>>
> 
> I'm not Kees, but I'd be okay with that.  I still think that doing
> this by process hierarchy a la seccomp will be easier to use and to
> understand (which is quite important for this kind of work) than doing
> it by cgroup.
> 
> A feature I've wanted to add for a while is to have an fd that
> represents a seccomp layer, the idea being that you would set up your
> seccomp layer (with syscall filter, landlock hooks, etc) and then you
> would have a syscall to install that layer.  Then an unprivileged
> sandbox manager could set up its layer and still be able to inject new
> processes into it later on, no cgroups needed.

A nice thing I didn't highlight about Landlock is that a process can
prepare a layer of rules (arraymap of handles + Landlock programs) and
pass the file descriptors of the Landlock programs to another process.
This process could then apply this programs to get sandboxed. However,
for now, because a Landlock program is only triggered by a seccomp
filter (which do not follow the Landlock programs as a FD), they will be
useless.

The FD referring to an arraymap of handles can also be used to update a
map and change the behavior of a Landlock program. A master process can
then add or remove restrictions to another process hierarchy on the fly.

However, I think it would make more sense to use cgroups if we want to
move an existing (unwilling) unsandoxed process into a sandboxed
environment. Of course, some more no_new_privs checks would be needed.



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/5] rxrpc: fix last_call processing

2016-08-28 Thread David Howells

This is fixed by:

commit 2266ffdef5737fdfa96005204fc5606dbd559956
subject: rxrpc: Fix conn-based retransmit

which is in net-next.

David

Re: [RFC v2 09/10] landlock: Handle cgroups

2016-08-28 Thread Andy Lutomirski

On Aug 27, 2016 8:12 PM, "Alexei Starovoitov"
 wrote:
>
> On Sat, Aug 27, 2016 at 12:30:36AM -0700, Andy Lutomirski wrote:
> > > cgroup is the common way to group multiple tasks.
> > > Without cgroup only parent<->child relationship will be possible,
> > > which will limit usability of such lsm to a master task that controls
> > > its children. Such api restriction would have been ok, if we could
> > > extend it in the future, but unfortunately task-centric won't allow it
> > > without creating a parallel lsm that is cgroup based.
> > > Therefore I think we have to go with cgroup-centric api and your
> > > application has to use cgroups from the start though only parent-child
> > > would have been enough.
> > > Also I don't think the kernel can afford two bpf based lsm. One task
> > > based and another cgroup based, so we have to find common ground
> > > that suits both use cases.
> > > Having unprivliged access is a subset. There is no strong reason why
> > > cgroup+lsm+bpf should be limited to root only always.
> > > When we can guarantee no pointer leaks, we can allow unpriv.
> >
> > I don't really understand what you mean.  In the context of landlock,
> > which is a *sandbox*, can one of you explain a use case that
> > materially benefits from this type of cgroup usage?  I haven't thought
> > of one.
>
> In case of seccomp-like sandbox where parent controls child processes
> cgroup is not needed. It's needed when container management software
> needs to control a set of applications. If we can have one bpf-based lsm
> that works via cgroup and without, I'd be fine with it. Right now
> I haven't seen a plausible proposal to do that. Therefore cgroup based
> api is a common api that works for sandbox as well, though requiring
> parent to create a cgroup just to control a single child is cumbersome.
>

I don't believe that a common API can work to accomplish your goal.
For privileged container management, the manager is trusted.  For
unprivileged sandboxing, the manager is emphatically not trusted,
which means you need special rules like NO_NEW_PRIVS, and, unless you
want to start restricting setuid and such in some cgroups, you really
do need a different interface for joining the sandbox than whatever
the container manager is using.

What could make sense is to have one BPF-based LSM that supports both
a seccomp-like unprivileged interface and a cgroup-based privileged
interface.  Most of the code for it is the BPF part anyway -- all that
the cgroup or seccomp part needs to do is to figure out which BPF
program(s) to call.

Also, for container management software, you don't really need
everything tied to cgroup -- you just need a way to cleanly add new
processes to the same security context.

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-28 Thread Andy Lutomirski

On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
>
>
> On 27/08/2016 22:43, Alexei Starovoitov wrote:
> > On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> >> On 27/08/2016 20:06, Alexei Starovoitov wrote:
> >>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>  As said above, Landlock will not run an eBPF programs when not strictly
>  needed. Attaching to a cgroup will have the same performance impact as
>  attaching to a process hierarchy.
> >>>
> >>> Having a prog per cgroup per lsm_hook is the only scalable way I
> >>> could come up with. If you see another way, please propose.
> >>> current->seccomp.landlock_prog is not the answer.
> >>
> >> Hum, I don't see the difference from a performance point of view between
> >> a cgroup-based or a process hierarchy-based system.
> >>
> >> Maybe a better option should be to use an array of pointers with N
> >> entries, one for each supported hook, instead of a unique pointer list?
> >
> > yes, clearly array dereference is faster than link list walk.
> > Now the question is where to keep this prog_array[num_lsm_hooks] ?
> > Since we cannot keep it inside task_struct, we have to allocate it.
> > Every time the task is creted then. What to do on the fork? That
> > will require changes all over. Then the obvious optimization would be
> > to share this allocated array of prog pointers across multiple tasks...
> > and little by little this new facility will look like cgroup.
> > Hence the suggestion to put this array into cgroup from the start.
>
> I see your point :)
>
> >
> >> Anyway, being able to attach an LSM hook program to a cgroup thanks to
> >> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
> >> to use a process hierarchy). The downside will be to handle an LSM hook
> >> program which is not triggered by a seccomp-filter, but this should be
> >> needed anyway to handle interruptions.
> >
> > what do you mean 'not triggered by seccomp' ?
> > You're not suggesting that this lsm has to enable seccomp to be functional?
> > imo that's non starter due to overhead.
>
> Yes, for now, it is triggered by a new seccomp filter return value
> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
> be needed but could be useful to bind a seccomp filter security policy
> with a Landlock one. Waiting for Kees's point of view…
>

I'm not Kees, but I'd be okay with that.  I still think that doing
this by process hierarchy a la seccomp will be easier to use and to
understand (which is quite important for this kind of work) than doing
it by cgroup.

A feature I've wanted to add for a while is to have an fd that
represents a seccomp layer, the idea being that you would set up your
seccomp layer (with syscall filter, landlock hooks, etc) and then you
would have a syscall to install that layer.  Then an unprivileged
sandbox manager could set up its layer and still be able to inject new
processes into it later on, no cgroups needed.

--Andy

84 matches

Mail list logo