[PATCH] net: hns: dereference ppe_cb->ppe_common_cb if it is non-null

2016-08-24 Thread Colin King
From: Colin Ian King 

ppe_cb->ppe_common_cb is being dereferenced before a null check is
being made on it.  If ppe_cb->ppe_common_cb is null then we end up
with a null pointer dereference when assigning dsaf_dev.  Fix this
by moving the initialisation of dsaf_dev once we know
ppe_cb->ppe_common_cb is OK to dereference.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
index ff8b6a4..6ea8722 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
@@ -328,9 +328,10 @@ static void hns_ppe_init_hw(struct hns_ppe_cb *ppe_cb)
 static void hns_ppe_uninit_hw(struct hns_ppe_cb *ppe_cb)
 {
u32 port;
-   struct dsaf_device *dsaf_dev = ppe_cb->ppe_common_cb->dsaf_dev;
 
if (ppe_cb->ppe_common_cb) {
+   struct dsaf_device *dsaf_dev = ppe_cb->ppe_common_cb->dsaf_dev;
+
port = ppe_cb->index;
dsaf_dev->misc_op->ppe_srst(dsaf_dev, port, 0);
}
-- 
2.9.3



Re: [PATCH 0/5] Networking cgroup controller

2016-08-24 Thread महेश बंडेवार
On Tue, Aug 23, 2016 at 1:49 AM, Parav Pandit  wrote:
> Hi Anoop,
>
> Regardless of usecase, I think this functionality is best handled as
> LSM functionality instead of cgroup.
>
I'm not so sure about that. Cgroup APIs are useful and this is just an
extension to it.


> Tasks which are proposed in this patch are related to access control checks.
> LSM already has required hooks for socket operations such as bind(),
> listen() as few small examples.
>
> Refer to security_socket_listen() which invokes LSM specific hooks.
> This is invoked in source/net/socket.c as part of listen() system call.
> LSM hook callback can check whether a given a process can listen to
> requested UDP port or not.
>
This has administrative overhead that is not addressed. The underlying
cgroup infrastructure takes care of it in this (current)
implementation.

> Parav
>
>
[...]


[PATCH] softirq: fix tasklet_kill() and its users

2016-08-24 Thread Santosh Shilimkar
Semantically the expectation from the tasklet init/kill API
should be as below.

tasklet_init() == Init and Enable scheduling
tasklet_kill() == Disable scheduling and Destroy

tasklet_init() API exibit above behavior but not the
tasklet_kill(). The tasklet handler can still get scheduled
and run even after the tasklet_kill().

There are 2, 3 places where drivers are working around
this issue by calling tasklet_disable() which will add an
usecount and there by avoiding the handlers being called.

tasklet_enable/tasklet_disable is a pair API and expected
to be used together. Usage of tasklet_disable() *just* to
workround tasklet scheduling after kill is probably not the
correct and inteded use of the API as done the API.
We also happen to see similar issue where in shutdown path
the tasklet_handler was getting called even after the
tasklet_kill().

We fix this be making sure tasklet_kill() does right
thing and there by ensuring tasklet handler won't run after
tasklet_kil() with very simple change. Patch fixes the tasklet
code and also few drivers workarounds.

Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Tadeusz Struk 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Paul Bolle 
Cc: Giovanni Cabiddu 
Cc: Salvatore Benedetto 
Cc: Karsten Keil 
Cc: "Peter Zijlstra (Intel)" 

Signed-off-by: Santosh Shilimkar 
---
Removed RFC tag from last post and dropped atmel serial
driver which seems to have been fixed in 4.8

https://lkml.org/lkml/2016/8/7/7

 drivers/crypto/qat/qat_common/adf_isr.c| 1 -
 drivers/crypto/qat/qat_common/adf_sriov.c  | 1 -
 drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 --
 drivers/isdn/gigaset/interface.c   | 1 -
 kernel/softirq.c   | 7 ---
 5 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/adf_isr.c 
b/drivers/crypto/qat/qat_common/adf_isr.c
index 06d4901..fd5e900 100644
--- a/drivers/crypto/qat/qat_common/adf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_isr.c
@@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
int i;
 
for (i = 0; i < hw_data->num_banks; i++) {
-   tasklet_disable(&priv_data->banks[i].resp_handler);
tasklet_kill(&priv_data->banks[i].resp_handler);
}
 }
diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c 
b/drivers/crypto/qat/qat_common/adf_sriov.c
index 9320ae1..bc7c2fa 100644
--- a/drivers/crypto/qat/qat_common/adf_sriov.c
+++ b/drivers/crypto/qat/qat_common/adf_sriov.c
@@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev)
}
 
for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) {
-   tasklet_disable(&vf->vf2pf_bh_tasklet);
tasklet_kill(&vf->vf2pf_bh_tasklet);
mutex_destroy(&vf->pf2vf_lock);
}
diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c 
b/drivers/crypto/qat/qat_common/adf_vf_isr.c
index bf99e11..6e38bff 100644
--- a/drivers/crypto/qat/qat_common/adf_vf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c
@@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev 
*accel_dev)
 
 static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev)
 {
-   tasklet_disable(&accel_dev->vf.pf2vf_bh_tasklet);
tasklet_kill(&accel_dev->vf.pf2vf_bh_tasklet);
mutex_destroy(&accel_dev->vf.vf2pf_lock);
 }
@@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
 {
struct adf_etr_data *priv_data = accel_dev->transport;
 
-   tasklet_disable(&priv_data->banks[0].resp_handler);
tasklet_kill(&priv_data->banks[0].resp_handler);
 }
 
diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c
index 600c79b..2ce63b6 100644
--- a/drivers/isdn/gigaset/interface.c
+++ b/drivers/isdn/gigaset/interface.c
@@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs)
if (!drv->have_tty)
return;
 
-   tasklet_disable(&cs->if_wake_tasklet);
tasklet_kill(&cs->if_wake_tasklet);
cs->tty_dev = NULL;
tty_unregister_device(drv->tty, cs->minor_index);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b..21397eb 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a)
list = list->next;
 
if (tasklet_trylock(t)) {
-   if (!atomic_read(&t->count)) {
+   if (atomic_read(&t->count) == 1) {
if (!test_and_clear_bit(TASKLET_STATE_SCHED,
&t->state))
BUG();
@@ -534,7 +534,7 @@ static void tasklet_hi_action(struct softirq_action *a)
list = list->next;
 
if (tasklet_trylock(t)) {
-   if (!atomic_read(&t->count)) {
+   if (atomic_read(&t

[PATCH net-next 5/6] net: dsa: bcm_sf2: Utilize core B53 driver when possible

2016-08-24 Thread Florian Fainelli
The Broadcom Starfighter2 is almost entirely register compatible with
B53, yet for historical reasons came up first in the tree and is now
being updated to utilize b53_common.c to the fullest extent possible. A
few things need to be adjusted to allow that:

- the switch "core" registers currently operate on a 32-bit address,
  whereas b53 passes a page + reg pair to offset from, so we need to
  convert that, thankfully there is a generic formula to do that

- the link managemenent is not self contained with the B53/CORE register
  set, but instead is in the SWITCH_REG block which is part of the
  integration glue logic, so we keep that entirely custom here because
  this really is part of the existing bcm_sf2 implementation

- there are additional power management constraints on the port's
  memories that make us keep the port_enable/disable callbacks custom
  for now, also, we support tagging whereas b53_common does not support
  that yet

All the VLAN and bridge code is entirely identical though so, avoid
duplicating it. Other things will be migrated in the future like EEE and
possibly Wake-on-LAN.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/Kconfig   |   1 +
 drivers/net/dsa/bcm_sf2.c | 230 --
 drivers/net/dsa/bcm_sf2.h |  11 +++
 3 files changed, 195 insertions(+), 47 deletions(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 8f4544394f44..de6d04429a70 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -16,6 +16,7 @@ config NET_DSA_BCM_SF2
select FIXED_PHY
select BCM7XXX_PHY
select MDIO_BCM_UNIMAC
+   select B53
---help---
  This enables support for the Broadcom Starfighter 2 Ethernet
  switch chips.
diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index b47a74b37a42..56e898f01c0f 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -29,9 +29,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bcm_sf2.h"
 #include "bcm_sf2_regs.h"
+#include "b53/b53_priv.h"
+#include "b53/b53_regs.h"
 
 /* String, offset, and register size in bytes if different from 4 bytes */
 static const struct bcm_sf2_hw_stats bcm_sf2_mib[] = {
@@ -106,7 +109,7 @@ static void bcm_sf2_sw_get_strings(struct dsa_switch *ds,
 static void bcm_sf2_sw_get_ethtool_stats(struct dsa_switch *ds,
 int port, uint64_t *data)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
const struct bcm_sf2_hw_stats *s;
unsigned int i;
u64 val = 0;
@@ -143,7 +146,7 @@ static enum dsa_tag_protocol 
bcm_sf2_sw_get_tag_protocol(struct dsa_switch *ds)
 
 static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, int cpu_port)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
unsigned int i;
u32 reg;
 
@@ -163,7 +166,7 @@ static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, 
int cpu_port)
 
 static void bcm_sf2_imp_setup(struct dsa_switch *ds, int port)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
u32 reg, val;
 
/* Enable the port memories */
@@ -228,7 +231,7 @@ static void bcm_sf2_imp_setup(struct dsa_switch *ds, int 
port)
 
 static void bcm_sf2_eee_enable_set(struct dsa_switch *ds, int port, bool 
enable)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
u32 reg;
 
reg = core_readl(priv, CORE_EEE_EN_CTRL);
@@ -241,7 +244,7 @@ static void bcm_sf2_eee_enable_set(struct dsa_switch *ds, 
int port, bool enable)
 
 static void bcm_sf2_gphy_enable_set(struct dsa_switch *ds, bool enable)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
u32 reg;
 
reg = reg_readl(priv, REG_SPHY_CNTRL);
@@ -315,7 +318,7 @@ static inline void bcm_sf2_port_intr_disable(struct 
bcm_sf2_priv *priv,
 static int bcm_sf2_port_setup(struct dsa_switch *ds, int port,
  struct phy_device *phy)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
s8 cpu_port = ds->dst[ds->index].cpu_port;
u32 reg;
 
@@ -371,7 +374,7 @@ static int bcm_sf2_port_setup(struct dsa_switch *ds, int 
port,
 static void bcm_sf2_port_disable(struct dsa_switch *ds, int port,
 struct phy_device *phy)
 {
-   struct bcm_sf2_priv *priv = ds_to_priv(ds);
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
u32 off, reg;
 
if (priv->wol_ports_mask & (1 << port))
@@ -403,7 +406,7 @@ static void bcm_sf2_port_disable(struct dsa_switch *ds, int 
port,
 static int bcm_sf2_eee_init(struct dsa_switch *ds, int port,
str

[PATCH net-next 3/6] net: dsa: b53: Define SF2 MIB layout

2016-08-24 Thread Florian Fainelli
The 58xx and 7445 chips use the Starfighter2 code, define its MIB layout
and introduce a helper function: is58xx() which checks for both of these
IDs for now.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 63 
 drivers/net/dsa/b53/b53_priv.h   |  6 
 2 files changed, 69 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 0e6b8125a8ea..e59d799880e4 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -167,6 +167,65 @@ static const struct b53_mib_desc b53_mibs[] = {
 
 #define B53_MIBS_SIZE  ARRAY_SIZE(b53_mibs)
 
+static const struct b53_mib_desc b53_mibs_58xx[] = {
+   { 8, 0x00, "TxOctets" },
+   { 4, 0x08, "TxDropPkts" },
+   { 4, 0x0c, "TxQPKTQ0" },
+   { 4, 0x10, "TxBroadcastPkts" },
+   { 4, 0x14, "TxMulticastPkts" },
+   { 4, 0x18, "TxUnicastPKts" },
+   { 4, 0x1c, "TxCollisions" },
+   { 4, 0x20, "TxSingleCollision" },
+   { 4, 0x24, "TxMultipleCollision" },
+   { 4, 0x28, "TxDeferredCollision" },
+   { 4, 0x2c, "TxLateCollision" },
+   { 4, 0x30, "TxExcessiveCollision" },
+   { 4, 0x34, "TxFrameInDisc" },
+   { 4, 0x38, "TxPausePkts" },
+   { 4, 0x3c, "TxQPKTQ1" },
+   { 4, 0x40, "TxQPKTQ2" },
+   { 4, 0x44, "TxQPKTQ3" },
+   { 4, 0x48, "TxQPKTQ4" },
+   { 4, 0x4c, "TxQPKTQ5" },
+   { 8, 0x50, "RxOctets" },
+   { 4, 0x58, "RxUndersizePkts" },
+   { 4, 0x5c, "RxPausePkts" },
+   { 4, 0x60, "RxPkts64Octets" },
+   { 4, 0x64, "RxPkts65to127Octets" },
+   { 4, 0x68, "RxPkts128to255Octets" },
+   { 4, 0x6c, "RxPkts256to511Octets" },
+   { 4, 0x70, "RxPkts512to1023Octets" },
+   { 4, 0x74, "RxPkts1024toMaxPktsOctets" },
+   { 4, 0x78, "RxOversizePkts" },
+   { 4, 0x7c, "RxJabbers" },
+   { 4, 0x80, "RxAlignmentErrors" },
+   { 4, 0x84, "RxFCSErrors" },
+   { 8, 0x88, "RxGoodOctets" },
+   { 4, 0x90, "RxDropPkts" },
+   { 4, 0x94, "RxUnicastPkts" },
+   { 4, 0x98, "RxMulticastPkts" },
+   { 4, 0x9c, "RxBroadcastPkts" },
+   { 4, 0xa0, "RxSAChanges" },
+   { 4, 0xa4, "RxFragments" },
+   { 4, 0xa8, "RxJumboPkt" },
+   { 4, 0xac, "RxSymblErr" },
+   { 4, 0xb0, "InRangeErrCount" },
+   { 4, 0xb4, "OutRangeErrCount" },
+   { 4, 0xb8, "EEELpiEvent" },
+   { 4, 0xbc, "EEELpiDuration" },
+   { 4, 0xc0, "RxDiscard" },
+   { 4, 0xc8, "TxQPKTQ6" },
+   { 4, 0xcc, "TxQPKTQ7" },
+   { 4, 0xd0, "TxPkts64Octets" },
+   { 4, 0xd4, "TxPkts65to127Octets" },
+   { 4, 0xd8, "TxPkts128to255Octets" },
+   { 4, 0xdc, "TxPkts256to511Ocets" },
+   { 4, 0xe0, "TxPkts512to1023Ocets" },
+   { 4, 0xe4, "TxPkts1024toMaxPktOcets" },
+};
+
+#define B53_MIBS_58XX_SIZE ARRAY_SIZE(b53_mibs_58xx)
+
 static int b53_do_vlan_op(struct b53_device *dev, u8 op)
 {
unsigned int i;
@@ -635,6 +694,8 @@ static const struct b53_mib_desc *b53_get_mib(struct 
b53_device *dev)
return b53_mibs_65;
else if (is63xx(dev))
return b53_mibs_63xx;
+   else if (is58xx(dev))
+   return b53_mibs_58xx;
else
return b53_mibs;
 }
@@ -645,6 +706,8 @@ static unsigned int b53_get_mib_size(struct b53_device *dev)
return B53_MIBS_65_SIZE;
else if (is63xx(dev))
return B53_MIBS_63XX_SIZE;
+   else if (is58xx(dev))
+   return B53_MIBS_58XX_SIZE;
else
return B53_MIBS_SIZE;
 }
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index cf2ff2cbc8ab..76672dae412d 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -175,6 +175,12 @@ static inline int is5301x(struct b53_device *dev)
dev->chip_id == BCM53019_DEVICE_ID;
 }
 
+static inline int is58xx(struct b53_device *dev)
+{
+   return dev->chip_id == BCM58XX_DEVICE_ID ||
+   dev->chip_id == BCM7445_DEVICE_ID;
+}
+
 #define B53_CPU_PORT_255
 #define B53_CPU_PORT   8
 
-- 
2.7.4



[PATCH net-next 1/6] net: dsa: b53: Initialize ds->drv in b53_switch_alloc

2016-08-24 Thread Florian Fainelli
In order to alloc drivers to override specific dsa_switch_driver
callbacks, initialize ds->drv to b53_switch_ops earlier, which avoids
having to expose this structure to glue drivers.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 65ecb51f99e5..30377ceb1928 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1602,7 +1602,6 @@ static const struct b53_chip_data b53_switch_chips[] = {
 
 static int b53_switch_init(struct b53_device *dev)
 {
-   struct dsa_switch *ds = dev->ds;
unsigned int i;
int ret;
 
@@ -1618,7 +1617,6 @@ static int b53_switch_init(struct b53_device *dev)
dev->vta_regs[1] = chip->vta_regs[1];
dev->vta_regs[2] = chip->vta_regs[2];
dev->jumbo_pm_reg = chip->jumbo_pm_reg;
-   ds->drv = &b53_switch_ops;
dev->cpu_port = chip->cpu_port;
dev->num_vlans = chip->vlans;
dev->num_arl_entries = chip->arl_entries;
@@ -1706,6 +1704,7 @@ struct b53_device *b53_switch_alloc(struct device *base,
dev->ds = ds;
dev->priv = priv;
dev->ops = ops;
+   ds->drv = &b53_switch_ops;
mutex_init(&dev->reg_mutex);
mutex_init(&dev->stats_mutex);
 
-- 
2.7.4



Re: [net] i40e: Change some init flow for the client

2016-08-24 Thread Jeff Kirsher
On Wed, 2016-08-24 at 17:51 -0700, Jeff Kirsher wrote:
> From: Anjali Singhai Jain 
> 
> This change makes a common flow for Client instance open during init
> and reset path. The Client subtask can handle both the cases instead of
> making a separate notify_client_of_open call.
> Also it may fix a bug during reset where the service task was leaking
> some memory and causing issues.
> 
> Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd
> Signed-off-by: Anjali Singhai Jain 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_client.c | 41 -
> --
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
>  2 files changed, 30 insertions(+), 12 deletions(-)

While the original patch description did not call this out clearly, this
patch fixes an issue with the RDMA/iWARP driver i40iw, which would randomly
crash or hang without these changes.

signature.asc
Description: This is a digitally signed message part


Improving OCTEON II 10G Ethernet performance

2016-08-24 Thread Ed Swierk
I'm trying to migrate from the Octeon SDK to a vanilla Linux 4.4
kernel for a Cavium OCTEON II (CN6880) board running in 64-bit
little-endian mode. So far I've gotten most of the hardware features I
need working, including XAUI/RXAUI, USB, boot bus and I2C, with a
fairly small set of patches.
https://github.com/skyportsystems/linux/compare/master...octeon2

The biggest remaining hurdle is improving 10G Ethernet performance:
iperf -P 10 on the SDK kernel gets close to 10 Gbit/sec throughput,
while on my 4.4 kernel, it tops out around 1 Gbit/sec.

Comparing the octeon-ethernet driver in the SDK
(http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/tree/drivers/net/ethernet/octeon?h=apaliwal/octeon)
against the one in 4.4, the latter appears to utilize only a single
CPU core for the rx path. It's not clear to me if there is a similar
issue on the tx side, or other bottlenecks.

I started trying to port multi-CPU rx from the SDK octeon-ethernet
driver, but had trouble teasing out just the necessary bits without
following a maze of dependencies on unrelated functions. (Dragging
major parts of the SDK wholesale into 4.4 defeats the purpose of
switching to a vanilla kernel, and doesn't bring us closer to getting
octeon-ethernet out of staging.)

Has there been any work on the octeon-ethernet driver since this patch
set? https://www.linux-mips.org/archives/linux-mips/2015-08/msg00338.html

Any hints on what to pick out of the SDK code to improve 10G
performance would be appreciated.

--Ed


[PATCH net] veth: sctp: add NETIF_F_SCTP_CRC to device features

2016-08-24 Thread Xin Long
Commit b17c706987fa ("loopback: sctp: add NETIF_F_SCTP_CSUM to device
features") added NETIF_F_SCTP_CRC to device features for lo device to
improve the performance of sctp over lo.

This patch is to add NETIF_F_SCTP_CRC to device features for veth to
improve the performance of sctp over veth.

Before this patch:
  ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
  Recv   SendSend
  Socket Socket  Message  Elapsed
  Size   SizeSize Time Throughput
  bytes  bytes   bytessecs.10^6bits/sec

  212992 212992  1024010.001117.16

After this patch:
  ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
  Recv   SendSend
  Socket Socket  Message  Elapsed
  Size   SizeSize Time Throughput
  bytes  bytes   bytessecs.10^6bits/sec

  212992 212992  1024010.201415.22

Tested-by: Li Shuang 
Signed-off-by: Xin Long 
---
 drivers/net/veth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f37a6e6..4bda502 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -313,7 +313,7 @@ static const struct net_device_ops veth_netdev_ops = {
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-  NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \
+  NETIF_F_RXCSUM | NETIF_F_SCTP_CRC | NETIF_F_HIGHDMA | \
   NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \
   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
   NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
-- 
2.1.0



[PATCH net-next 0/6] net: dsa: Make bcm_sf2 utilize b53_common

2016-08-24 Thread Florian Fainelli
Hi all,

This patch series makes the bcm_sf2 driver utilize a large number of the core
functions offered by the b53_common driver since the SWITCH_CORE registers are
mostly register compatible with the switches driven by b53_common.

In order to accomplish that, we just override the dsa_driver_ops callbacks that
we need to. There are still integration specific logic from the bcm_sf2 that we
cannot absorb into b53_common because it is just not there, mostly in the area
of link management and power management, but most of the features are within
b53_common now: VLAN, FDB, bridge

Along the process, we also improve support for the BCM58xx SoCs, since those
also have the same version of the switching IP that 7445 has (for which bcm_sf2
was developed).

Florian Fainelli (6):
  net: dsa: b53: Initialize ds->drv in b53_switch_alloc
  net: dsa: b53: Prepare to support 7445 switch
  net: dsa: b53: Define SF2 MIB layout
  net: dsa: b53: Add JOIN_ALL_VLAN support
  net: dsa: bcm_sf2: Utilize core B53 driver when possible
  net: dsa: bcm_sf2: Remove duplicate code

 drivers/net/dsa/Kconfig  |   1 +
 drivers/net/dsa/b53/b53_common.c | 108 -
 drivers/net/dsa/b53/b53_priv.h   |   7 +
 drivers/net/dsa/b53/b53_regs.h   |   3 +
 drivers/net/dsa/bcm_sf2.c| 932 +++
 drivers/net/dsa/bcm_sf2.h|  82 +---
 drivers/net/dsa/bcm_sf2_regs.h   | 122 -
 7 files changed, 288 insertions(+), 967 deletions(-)

-- 
2.7.4



[PATCH net-next 2/6] net: dsa: b53: Prepare to support 7445 switch

2016-08-24 Thread Florian Fainelli
Allocate a device entry for the Broadcom BCM7445 integrated switch
currently backed by bcm_sf2.c. Since this is the latest generation, it
has 4 ARL entries, 4K VLANs and uses Port 8 for the CPU/IMP port.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 12 
 drivers/net/dsa/b53/b53_priv.h   |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 30377ceb1928..0e6b8125a8ea 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1598,6 +1598,18 @@ static const struct b53_chip_data b53_switch_chips[] = {
.jumbo_pm_reg = B53_JUMBO_PORT_MASK,
.jumbo_size_reg = B53_JUMBO_MAX_SIZE,
},
+   {
+   .chip_id = BCM7445_DEVICE_ID,
+   .dev_name = "BCM7445",
+   .vlans  = 4096,
+   .enabled_ports = 0x1ff,
+   .arl_entries = 4,
+   .cpu_port = B53_CPU_PORT,
+   .vta_regs = B53_VTA_REGS,
+   .duplex_reg = B53_DUPLEX_STAT_GE,
+   .jumbo_pm_reg = B53_JUMBO_PORT_MASK,
+   .jumbo_size_reg = B53_JUMBO_MAX_SIZE,
+   },
 };
 
 static int b53_switch_init(struct b53_device *dev)
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index d268493a5fec..cf2ff2cbc8ab 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -60,6 +60,7 @@ enum {
BCM53018_DEVICE_ID = 0x53018,
BCM53019_DEVICE_ID = 0x53019,
BCM58XX_DEVICE_ID = 0x5800,
+   BCM7445_DEVICE_ID = 0x7445,
 };
 
 #define B53_N_PORTS9
-- 
2.7.4



Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list

2016-08-24 Thread Xin Long
> Or add a refcnt to its members. 
> NETDEV_UP, it gets a ++ if it's already there
> NETDEV_DOWN, it gets a -- and cleans it up if it reaches 0
> And the rest probably could stay the same.
>
Yes, it could also avoid the issue of amounts of duplicate addrs.
or add a nic index variable to  its members.

But I still prefer the current patch.
1. This issue only happens when server bind 'ANY' addresses.
we don't need to add any new members to struct sctp_sockaddr_entry.
especially if it's a really corner issue,  we fix this as an improvement.

2. It's yet two issues  here, the duplicate addrs may be from
   a) different local NICs.
   b) the same one NIC.
   It may be unexpectable to filter them in NETDEV_UP/DOWN events.

3. We check it only when sctp really binds it, just like sctp_do_bind.

What do you think ?


[PATCH net-next v4 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread David Ahern
As reported by Lennert the MPLS GSO code is failing to properly segment
large packets. There are a couple of problems:

1. the inner protocol is not set so the gso segment functions for inner
   protocol layers are not getting run, and

2  MPLS labels for packets that use the "native" (non-OVS) MPLS code
   are not properly accounted for in mpls_gso_segment.

The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment
to call the gso segment functions for the higher layer protocols. That
means skb_mac_gso_segment is called twice -- once with the network
protocol set to MPLS and again with the network protocol set to the
inner protocol.

This patch sets the inner skb protocol addressing item 1 above and sets
the network_header and inner_network_header to mark where the MPLS labels
start and end. The MPLS code in OVS is also updated to set the two
network markers.

>From there the MPLS GSO code uses the difference between the network
header and the inner network header to know the size of the MPLS header
that was pushed. It then pulls the MPLS header, resets the mac_len and
protocol for the inner protocol and then calls skb_mac_gso_segment
to segment the skb.

Afterward the inner protocol segmentation is done the skb protocol
is set to mpls for each segment and the network and mac headers
restored.

Reported-by: Lennert Buytenhek 
Signed-off-by: David Ahern 
---
 net/mpls/mpls_gso.c   | 40 +---
 net/mpls/mpls_iptunnel.c  |  4 
 net/openvswitch/actions.c |  9 +++--
 3 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
index 2055e57ed1c3..b4da6d8e8632 100644
--- a/net/mpls/mpls_gso.c
+++ b/net/mpls/mpls_gso.c
@@ -23,32 +23,50 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
   netdev_features_t features)
 {
struct sk_buff *segs = ERR_PTR(-EINVAL);
+   u16 mac_offset = skb->mac_header;
netdev_features_t mpls_features;
+   u16 mac_len = skb->mac_len;
__be16 mpls_protocol;
+   unsigned int mpls_hlen;
+
+   skb_reset_network_header(skb);
+   mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb);
+   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
+   goto out;
 
/* Setup inner SKB. */
mpls_protocol = skb->protocol;
skb->protocol = skb->inner_protocol;
 
-   /* Push back the mac header that skb_mac_gso_segment() has pulled.
-* It will be re-pulled by the call to skb_mac_gso_segment() below
-*/
-   __skb_push(skb, skb->mac_len);
+   __skb_pull(skb, mpls_hlen);
+
+   skb->mac_len = 0;
+   skb_reset_mac_header(skb);
 
/* Segment inner packet. */
mpls_features = skb->dev->mpls_features & features;
segs = skb_mac_gso_segment(skb, mpls_features);
+   if (IS_ERR_OR_NULL(segs)) {
+   skb_gso_error_unwind(skb, mpls_protocol, mpls_hlen, mac_offset,
+mac_len);
+   goto out;
+   }
+   skb = segs;
+
+   mpls_hlen += mac_len;
+   do {
+   skb->mac_len = mac_len;
+   skb->protocol = mpls_protocol;
 
+   skb_reset_inner_network_header(skb);
 
-   /* Restore outer protocol. */
-   skb->protocol = mpls_protocol;
+   __skb_push(skb, mpls_hlen);
 
-   /* Re-pull the mac header that the call to skb_mac_gso_segment()
-* above pulled.  It will be re-pushed after returning
-* skb_mac_gso_segment(), an indirect caller of this function.
-*/
-   __skb_pull(skb, skb->data - skb_mac_header(skb));
+   skb_reset_mac_header(skb);
+   skb_set_network_header(skb, mac_len);
+   } while ((skb = skb->next));
 
+out:
return segs;
 }
 
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index aed872cc05a6..cf52cf30ac4b 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -90,7 +90,11 @@ static int mpls_xmit(struct sk_buff *skb)
if (skb_cow(skb, hh_len + new_header_size))
goto drop;
 
+   skb_set_inner_protocol(skb, skb->protocol);
+   skb_reset_inner_network_header(skb);
+
skb_push(skb, new_header_size);
+
skb_reset_network_header(skb);
 
skb->dev = out_dev;
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1ecbd7715f6d..ca91fc33f8a9 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -162,10 +162,16 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
if (skb_cow_head(skb, MPLS_HLEN) < 0)
return -ENOMEM;
 
+   if (!skb->inner_protocol) {
+   skb_set_inner_network_header(skb, skb->mac_len);
+   skb_set_inner_protocol(skb, skb->protocol);
+   }
+
skb_push(skb, MPLS_HLEN);
memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_

RE: [RFC PATCH 3/5] bnx2x: Add support for segmentation of tunnels with outer checksums

2016-08-24 Thread Yuval Mintz
> >> This patch assumes that the bnx2x hardware will ignore existing
> >> IPv4/v6 header fields for length and checksum as well as the length
> >> and checksum fields for outer UDP and GRE headers.
> >>
> >> I have no means of testing this as I do not have any bnx2x hardware
> >> but thought I would submit it as an RFC to see if anyone out there
> >> wants to test this and see if this does in fact enable this
> >> functionality allowing us to to segment tunneled frames that have an outer
> checksum.
> >>
> >> Signed-off-by: Alexander Duyck 
> >
> > So it took me some [well, a lot] time to reach this, but I've finally  gave 
> > it a try.
> > I saw a performance boost with the partial support - Throughput for
> > vxlan tunnels with and without udpcsum were almost identical after
> > this, whereas without this patch the udpcsum prevented GSO and a
> > TCP/IPv4 connection on top of it got roughly half the throughput.
> >
> > However, I did encounter one oddity I couldn't explain - After I've
> > disabled tx-udp_tnl-segmentation via ethtool on the base interface,
> > got left with:
> >tx-gso-partial: on
> >tx-udp_tnl-segmentation: off
> >tx-udp_tnl-csum-segmentation: on
> >
> > When I ran traffic over both vxlan tunnels the one with the udpcsum
> > was still Passing gso aggregations to base device to transmit [and the
> > throughput was same as before], where's the tunnel without the udpcsum
> > showed only MTU-sized packets reaching the base interface for
> > transmission [which is what I've expected]
> >
> > Any idea why that happened?
> 
> So the way they are implemented tx-udp_tnl-segmentation and tx-udp_tnl-
> csum-segmentation are treated as two separate features.
> The kernel currently gives them the same treatment as NETIF_F_TSO and
> NETIF_F_TSO6.  You can disable one and the other still functions.
> 
> Now if you disable tx-gso-partial you should expect to see tx-udp_tnl-csum-
> segmentation be disabled because it is dependent on the partial GSO offload.
> 
> - Alex

O.k., thanks.
Then I'll run some more testing scenarios, and assuming everything works
fine I'll re-send this. Alex - should I place you at the 'from' field?


[PATCH net-next v4 0/3] net: mpls: fragmentation and gso fixes for locally originated traffic

2016-08-24 Thread David Ahern
This series fixes mtu and fragmentation for tunnels using lwtunnel
output redirect, and fixes GSO for MPLS for locally originated traffic
reported by Lennert Buytenhek.

A follow on series will address fragmentation and GSO for forwarded
MPLS traffic. Hardware offload of GSO with MPLS also needs to be
addressed.

Simon: Can you verify this works with OVS for single and multiple
   labels?

v4
- more updates to mpls_gso_segment per Alex's comments (thanks, Alex)
- updates to teaching OVS about marking MPLS labels as the network header
 
v3
- updates to mpls_gso_segment per Alex's comments
- dropped skb->encapsulation = 1 from mpls_xmit per Alex's comment

v2
- consistent use of network_header in skb to fix GSO for MPLS
- update MPLS code in OVS to network_header and inner_network_header


David Ahern (2):
  net: mpls: Fixups for GSO
  net: veth: Set features for MPLS

Roopa Prabhu (1):
  net: lwtunnel: Handle fragmentation

 drivers/net/veth.c|  1 +
 include/net/lwtunnel.h| 44 
 net/core/lwtunnel.c   | 35 +++
 net/ipv4/ip_output.c  |  8 
 net/ipv4/route.c  |  4 +++-
 net/ipv6/ip6_output.c |  8 
 net/ipv6/route.c  |  4 +++-
 net/mpls/mpls_gso.c   | 40 +---
 net/mpls/mpls_iptunnel.c  | 13 +
 net/openvswitch/actions.c |  9 +++--
 10 files changed, 147 insertions(+), 19 deletions(-)

-- 
2.1.4



[PATCH net-next v4 1/3] net: lwtunnel: Handle fragmentation

2016-08-24 Thread David Ahern
From: Roopa Prabhu 

Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu 
Signed-off-by: David Ahern 
---
 include/net/lwtunnel.h   | 44 
 net/core/lwtunnel.c  | 35 +++
 net/ipv4/ip_output.c |  8 
 net/ipv4/route.c |  4 +++-
 net/ipv6/ip6_output.c|  8 
 net/ipv6/route.c |  4 +++-
 net/mpls/mpls_iptunnel.c |  9 +
 7 files changed, 106 insertions(+), 6 deletions(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index e9f116e29c22..ea3f80f58fd6 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -13,6 +13,13 @@
 /* lw tunnel state flags */
 #define LWTUNNEL_STATE_OUTPUT_REDIRECT BIT(0)
 #define LWTUNNEL_STATE_INPUT_REDIRECT  BIT(1)
+#define LWTUNNEL_STATE_XMIT_REDIRECT   BIT(2)
+
+enum {
+   LWTUNNEL_XMIT_DONE,
+   LWTUNNEL_XMIT_CONTINUE,
+};
+
 
 struct lwtunnel_state {
__u16   type;
@@ -21,6 +28,7 @@ struct lwtunnel_state {
int (*orig_output)(struct net *net, struct sock *sk, struct 
sk_buff *skb);
int (*orig_input)(struct sk_buff *);
int len;
+   __u16   headroom;
__u8data[0];
 };
 
@@ -34,6 +42,7 @@ struct lwtunnel_encap_ops {
  struct lwtunnel_state *lwtstate);
int (*get_encap_size)(struct lwtunnel_state *lwtstate);
int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b);
+   int (*xmit)(struct sk_buff *skb);
 };
 
 #ifdef CONFIG_LWTUNNEL
@@ -75,6 +84,24 @@ static inline bool lwtunnel_input_redirect(struct 
lwtunnel_state *lwtstate)
 
return false;
 }
+
+static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate)
+{
+   if (lwtstate && (lwtstate->flags & LWTUNNEL_STATE_XMIT_REDIRECT))
+   return true;
+
+   return false;
+}
+
+static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate,
+unsigned int mtu)
+{
+   if (lwtunnel_xmit_redirect(lwtstate) && lwtstate->headroom < mtu)
+   return lwtstate->headroom;
+
+   return 0;
+}
+
 int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op,
   unsigned int num);
 int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op,
@@ -90,6 +117,7 @@ struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len);
 int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
+int lwtunnel_xmit(struct sk_buff *skb);
 
 #else
 
@@ -117,6 +145,17 @@ static inline bool lwtunnel_input_redirect(struct 
lwtunnel_state *lwtstate)
return false;
 }
 
+static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate)
+{
+   return false;
+}
+
+static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate,
+unsigned int mtu)
+{
+   return 0;
+}
+
 static inline int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op,
 unsigned int num)
 {
@@ -170,6 +209,11 @@ static inline int lwtunnel_input(struct sk_buff *skb)
return -EOPNOTSUPP;
 }
 
+static inline int lwtunnel_xmit(struct sk_buff *skb)
+{
+   return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_LWTUNNEL */
 
 #define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" 
__stringify(encap_type))
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 669ecc9f884e..e5f84c26ba1a 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -251,6 +251,41 @@ int lwtunnel_output(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 }
 EXPORT_SYMBOL(lwtunnel_output);
 
+int lwtunnel_xmit(struct sk_buff *skb)
+{
+   struct dst_entry *dst = skb_dst(skb);
+   const struct lwtunnel_encap_ops *ops;
+   struct lwtunnel_state *lwtstate;
+   int ret = -EINVAL;
+
+   if (!dst)
+   goto drop;
+
+   lwtstate = dst->lwtstate;
+
+   if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
+   lwtstate->type > LWTUNNEL_EN

Re: A second case of XPS considerably reducing single-stream performance

2016-08-24 Thread Rick Jones
Also, while it doesn't seem to have the same massive effect on 
throughput, I can also see out of order behaviour happening when the 
sending VM is on a node with a ConnectX-3 Pro NIC.  Its driver is also 
enabling XPS it would seem.  I'm not *certain* but looking at the traces 
it appears that with the ConnectX-3 Pro there is more interleaving of 
the out-of-order traffic than there is with the Skyhawk.  The ConnectX-3 
Pro happens to be in a newer generation server with a newer processor 
than the other systems where I've seen this.


I do not see the out-of-order behaviour when the NIC at the sending end 
is a BCM57840.  It does not appear that the bnx2x driver in the 4.4 
kernel is enabling XPS.


So, it would seem that there are three cases of enabling XPS resulting 
in out-of-order traffic, two of which result in a non-trivial loss of 
performance.


happy benchmarking,

rick jones


Re: [PATCH net-next v2 1/2] net: diag: slightly refactor the inet_diag_bc_audit error checks.

2016-08-24 Thread David Miller
From: Lorenzo Colitti 
Date: Wed, 24 Aug 2016 15:46:25 +0900

> This simplifies the code a bit and also allows inet_diag_bc_audit
> to send to userspace an error that isn't EINVAL.
> 
> Signed-off-by: Lorenzo Colitti 

Applied.


Re: [PATCH net-next v2 2/2] net: diag: allow socket bytecode filters to match socket marks

2016-08-24 Thread David Miller
From: Lorenzo Colitti 
Date: Wed, 24 Aug 2016 15:46:26 +0900

> This allows a privileged process to filter by socket mark when
> dumping sockets via INET_DIAG_BY_FAMILY. This is useful on
> systems that use mark-based routing such as Android.
> 
> The ability to filter socket marks requires CAP_NET_ADMIN, which
> is consistent with other privileged operations allowed by the
> SOCK_DIAG interface such as the ability to destroy sockets and
> the ability to inspect BPF filters attached to packet sockets.
> 
> Tested: https://android-review.googlesource.com/261350
> Signed-off-by: Lorenzo Colitti 

Applied.


Re: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to the Hisilicon RoCE Driver

2016-08-24 Thread David Miller
From: Salil Mehta 
Date: Wed, 24 Aug 2016 04:44:48 +0800

> This patch is meant to add support of ACPI to the Hisilicon RoCE driver.
> Following changes have been made in the driver(s):
> 
> Patch 1/2: HNS Ethernet Driver: changes to support ACPI have been done in
>the RoCE reset function part of the HNS ethernet driver. Earlier it only
>supported DT/syscon.
> 
> Patch 2/2. HNS RoCE driver: changes done in RoCE driver are meant to detect
>the type and then either use DT specific or ACPI spcific functions. Where
>ever possible, this patch tries to make use of "Unified Device Property
>Interface" APIs to support both DT and ACPI through single interface.
> 
> NOTE 1: ACPI changes done in both of the drivers depend upon the ACPI Table
>  (DSDT and IORT tables) changes part of UEFI/BIOS. These changes are NOT
>  part of this patch-set.
> NOTE 2: Reset function in Patch 1/2 depends upon the reset function added in
>  ACPI tables(basically DSDT table) part of the UEFI/BIOS. Again, this
>  change is NOT reflected in this patch-set.

I can't apply this series to my tree because the hns infiniband driver
doesn't exist in it.


Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread David Ahern
On 8/24/16 12:53 PM, David Ahern wrote:
> What change is needed in pop_mpls? It already resets the mac_header and if 
> MPLS labels are removed there is no need to set network_header. I take it you 
> mean if the protocol is still MPLS and there are still labels then the 
> network header needs to be set and that means finding the bottom label. Does 
> OVS set the bottom of stack bit? From what I can tell OVS is not parsing the 
> MPLS label so no requirement that BOS is set. Without that there is no way to 
> tell when the labels are done short of guessing.

I was confusing the inner network layer with the mpls network header. Just sent 
a v4. can you verify it works for single and multiple labels with OVS? 


[PATCH net-next v4 3/3] net: veth: Set features for MPLS

2016-08-24 Thread David Ahern
veth does not really transmit packets only moves the skb from one
netdev to another so gso and checksum is not really needed. Add
the features to mpls_features to get the same benefit and performance
with MPLS as without it.

Reported-by: Lennert Buytenhek 
Signed-off-by: David Ahern 
---
 drivers/net/veth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f37a6e61d4ad..5db320a4d5cf 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -340,6 +340,7 @@ static void veth_setup(struct net_device *dev)
 
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
+   dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
 }
 
 /*
-- 
2.1.4



[PATCH v1 1/1 net-next] 8139cp: Fix one possible deadloop in cp_rx_poll

2016-08-24 Thread fgao
From: Gao Feng 

When cp_rx_poll does not get enough packet, it will check the rx
interrupt status again. If so, it will jumpt to rx_status_loop again.
But the goto jump resets the rx variable as zero too.

As a result, it causes one possible deadloop. Assume this case,
rx_status_loop only gets the packet count which is less than budget,
and (cpr16(IntrStatus) & cp_rx_intr_mask) condition is always true.
It causes the deadloop happens and system is blocked.

Signed-off-by: Gao Feng 
---
 drivers/net/ethernet/realtek/8139cp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c 
b/drivers/net/ethernet/realtek/8139cp.c
index deae10d..5297bf7 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -467,8 +467,8 @@ static int cp_rx_poll(struct napi_struct *napi, int budget)
unsigned int rx_tail = cp->rx_tail;
int rx;
 
-rx_status_loop:
rx = 0;
+rx_status_loop:
cpw16(IntrStatus, cp_rx_intr_mask);
 
while (rx < budget) {
-- 
1.9.1




Re: [PATCH net-next] net: dsa: rename switch operations structure

2016-08-24 Thread David Miller
From: Vivien Didelot 
Date: Tue, 23 Aug 2016 12:38:56 -0400

> Now that the dsa_switch_driver structure contains only function pointers
> as it is supposed to, rename it to the more appropriate dsa_switch_ops,
> uniformly to any other operations structure in the kernel.
> 
> No functional changes here, basically just the result of something like:
> s/dsa_switch_driver *drv/dsa_switch_ops *ops/g
> 
> However keep the {un,}register_switch_driver functions and their
> dsa_switch_drivers list as is, since they represent the -- likely to be
> deprecated soon -- legacy DSA registration framework.
> 
> In the meantime, also fix the following checks from checkpatch.pl to
> make it happy with this patch:
 ...
> Signed-off-by: Vivien Didelot 

Applied, thanks Vivien.


[PATCH net-next v3 1/2] net: ethernet: mediatek: modify to use the PDMA instead of the QDMA for Ethernet RX

2016-08-24 Thread Nelson Chang
Because the PDMA has richer features than the QDMA for Ethernet RX
(such as multiple RX rings, HW LRO, etc.),
the patch modifies to use the PDMA to handle Ethernet RX.

Signed-off-by: Nelson Chang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 76 +
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 31 +++-
 2 files changed, 74 insertions(+), 33 deletions(-)
 mode change 100644 => 100755 drivers/net/ethernet/mediatek/mtk_eth_soc.c
 mode change 100644 => 100755 drivers/net/ethernet/mediatek/mtk_eth_soc.h

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
old mode 100644
new mode 100755
index 1801fd8..cbeb793
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -342,25 +342,27 @@ static void mtk_mdio_cleanup(struct mtk_eth *eth)
mdiobus_free(eth->mii_bus);
 }
 
-static inline void mtk_irq_disable(struct mtk_eth *eth, u32 mask)
+static inline void mtk_irq_disable(struct mtk_eth *eth,
+  unsigned reg, u32 mask)
 {
unsigned long flags;
u32 val;
 
spin_lock_irqsave(ð->irq_lock, flags);
-   val = mtk_r32(eth, MTK_QDMA_INT_MASK);
-   mtk_w32(eth, val & ~mask, MTK_QDMA_INT_MASK);
+   val = mtk_r32(eth, reg);
+   mtk_w32(eth, val & ~mask, reg);
spin_unlock_irqrestore(ð->irq_lock, flags);
 }
 
-static inline void mtk_irq_enable(struct mtk_eth *eth, u32 mask)
+static inline void mtk_irq_enable(struct mtk_eth *eth,
+ unsigned reg, u32 mask)
 {
unsigned long flags;
u32 val;
 
spin_lock_irqsave(ð->irq_lock, flags);
-   val = mtk_r32(eth, MTK_QDMA_INT_MASK);
-   mtk_w32(eth, val | mask, MTK_QDMA_INT_MASK);
+   val = mtk_r32(eth, reg);
+   mtk_w32(eth, val | mask, reg);
spin_unlock_irqrestore(ð->irq_lock, flags);
 }
 
@@ -897,12 +899,12 @@ release_desc:
 * we continue
 */
wmb();
-   mtk_w32(eth, ring->calc_idx, MTK_QRX_CRX_IDX0);
+   mtk_w32(eth, ring->calc_idx, MTK_PRX_CRX_IDX0);
done++;
}
 
if (done < budget)
-   mtk_w32(eth, MTK_RX_DONE_INT, MTK_QMTK_INT_STATUS);
+   mtk_w32(eth, MTK_RX_DONE_INT, MTK_PDMA_INT_STATUS);
 
return done;
 }
@@ -1012,7 +1014,7 @@ static int mtk_napi_tx(struct napi_struct *napi, int 
budget)
return budget;
 
napi_complete(napi);
-   mtk_irq_enable(eth, MTK_TX_DONE_INT);
+   mtk_irq_enable(eth, MTK_QDMA_INT_MASK, MTK_TX_DONE_INT);
 
return tx_done;
 }
@@ -1024,12 +1026,12 @@ static int mtk_napi_rx(struct napi_struct *napi, int 
budget)
int rx_done = 0;
 
mtk_handle_status_irq(eth);
-   mtk_w32(eth, MTK_RX_DONE_INT, MTK_QMTK_INT_STATUS);
+   mtk_w32(eth, MTK_RX_DONE_INT, MTK_PDMA_INT_STATUS);
rx_done = mtk_poll_rx(napi, budget, eth);
 
if (unlikely(netif_msg_intr(eth))) {
-   status = mtk_r32(eth, MTK_QMTK_INT_STATUS);
-   mask = mtk_r32(eth, MTK_QDMA_INT_MASK);
+   status = mtk_r32(eth, MTK_PDMA_INT_STATUS);
+   mask = mtk_r32(eth, MTK_PDMA_INT_MASK);
dev_info(eth->dev,
 "done rx %d, intr 0x%08x/0x%x\n",
 rx_done, status, mask);
@@ -1038,12 +1040,12 @@ static int mtk_napi_rx(struct napi_struct *napi, int 
budget)
if (rx_done == budget)
return budget;
 
-   status = mtk_r32(eth, MTK_QMTK_INT_STATUS);
+   status = mtk_r32(eth, MTK_PDMA_INT_STATUS);
if (status & MTK_RX_DONE_INT)
return budget;
 
napi_complete(napi);
-   mtk_irq_enable(eth, MTK_RX_DONE_INT);
+   mtk_irq_enable(eth, MTK_PDMA_INT_MASK, MTK_RX_DONE_INT);
 
return rx_done;
 }
@@ -1092,6 +1094,7 @@ static int mtk_tx_alloc(struct mtk_eth *eth)
mtk_w32(eth,
ring->phys + ((MTK_DMA_SIZE - 1) * sz),
MTK_QTX_DRX_PTR);
+   mtk_w32(eth, (QDMA_RES_THRES << 8) | QDMA_RES_THRES, MTK_QTX_CFG(0));
 
return 0;
 
@@ -1162,11 +1165,10 @@ static int mtk_rx_alloc(struct mtk_eth *eth)
 */
wmb();
 
-   mtk_w32(eth, eth->rx_ring.phys, MTK_QRX_BASE_PTR0);
-   mtk_w32(eth, MTK_DMA_SIZE, MTK_QRX_MAX_CNT0);
-   mtk_w32(eth, eth->rx_ring.calc_idx, MTK_QRX_CRX_IDX0);
-   mtk_w32(eth, MTK_PST_DRX_IDX0, MTK_QDMA_RST_IDX);
-   mtk_w32(eth, (QDMA_RES_THRES << 8) | QDMA_RES_THRES, MTK_QTX_CFG(0));
+   mtk_w32(eth, eth->rx_ring.phys, MTK_PRX_BASE_PTR0);
+   mtk_w32(eth, MTK_DMA_SIZE, MTK_PRX_MAX_CNT0);
+   mtk_w32(eth, eth->rx_ring.calc_idx, MTK_PRX_CRX_IDX0);
+   mtk_w32(eth, MTK_PST_DRX_IDX0, MTK_PDMA_RST_IDX);
 
return 0;
 }
@@ -1285,7 +1287,7 @@ static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth)
 
if (likely(napi_schedule_prep(ð->rx_napi)

[PATCH net-next v3 0/2] net: ethernet: mediatek: modify to use the PDMA for Ethernet RX

2016-08-24 Thread Nelson Chang
This patch set fixes the following issues

v1 -> v2: Fix the bugs of PDMA cpu index and interrupt settings in mtk_poll_rx()

v2 -> v3: Add GDM hardware settings to send packets to PDMA for RX

Nelson Chang (2):
  net: ethernet: mediatek: modify to use the PDMA instead of the QDMA
for Ethernet RX
  net: ethernet: mediatek: modify GDM to send packets to the PDMA for RX

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 80 +
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 31 ++-
 2 files changed, 76 insertions(+), 35 deletions(-)

-- 
1.9.1



[net] i40e: Change some init flow for the client

2016-08-24 Thread Jeff Kirsher
From: Anjali Singhai Jain 

This change makes a common flow for Client instance open during init
and reset path. The Client subtask can handle both the cases instead of
making a separate notify_client_of_open call.
Also it may fix a bug during reset where the service task was leaking
some memory and causing issues.

Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd
Signed-off-by: Anjali Singhai Jain 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_client.c | 41 ---
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_client.c 
b/drivers/net/ethernet/intel/i40e/i40e_client.c
index e1370c5..618f184 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_client.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_client.c
@@ -199,6 +199,7 @@ void i40e_notify_client_of_l2_param_changes(struct i40e_vsi 
*vsi)
 void i40e_notify_client_of_netdev_open(struct i40e_vsi *vsi)
 {
struct i40e_client_instance *cdev;
+   int ret = 0;
 
if (!vsi)
return;
@@ -211,7 +212,14 @@ void i40e_notify_client_of_netdev_open(struct i40e_vsi 
*vsi)
"Cannot locate client instance open 
routine\n");
continue;
}
-   cdev->client->ops->open(&cdev->lan_info, cdev->client);
+   if (!(test_bit(__I40E_CLIENT_INSTANCE_OPENED,
+  &cdev->state))) {
+   ret = cdev->client->ops->open(&cdev->lan_info,
+ cdev->client);
+   if (!ret)
+   set_bit(__I40E_CLIENT_INSTANCE_OPENED,
+   &cdev->state);
+   }
}
}
mutex_unlock(&i40e_client_instance_mutex);
@@ -407,12 +415,14 @@ struct i40e_vsi *i40e_vsi_lookup(struct i40e_pf *pf,
  * i40e_client_add_instance - add a client instance struct to the instance list
  * @pf: pointer to the board struct
  * @client: pointer to a client struct in the client list.
+ * @existing: if there was already an existing instance
  *
- * Returns cdev ptr on success, NULL on failure
+ * Returns cdev ptr on success or if already exists, NULL on failure
  **/
 static
 struct i40e_client_instance *i40e_client_add_instance(struct i40e_pf *pf,
- struct i40e_client 
*client)
+struct i40e_client *client,
+bool *existing)
 {
struct i40e_client_instance *cdev;
struct netdev_hw_addr *mac = NULL;
@@ -421,7 +431,7 @@ struct i40e_client_instance 
*i40e_client_add_instance(struct i40e_pf *pf,
mutex_lock(&i40e_client_instance_mutex);
list_for_each_entry(cdev, &i40e_client_instances, list) {
if ((cdev->lan_info.pf == pf) && (cdev->client == client)) {
-   cdev = NULL;
+   *existing = true;
goto out;
}
}
@@ -505,6 +515,7 @@ void i40e_client_subtask(struct i40e_pf *pf)
 {
struct i40e_client_instance *cdev;
struct i40e_client *client;
+   bool existing = false;
int ret = 0;
 
if (!(pf->flags & I40E_FLAG_SERVICE_CLIENT_REQUESTED))
@@ -528,18 +539,25 @@ void i40e_client_subtask(struct i40e_pf *pf)
/* check if L2 VSI is up, if not we are not ready */
if (test_bit(__I40E_DOWN, &pf->vsi[pf->lan_vsi]->state))
continue;
+   } else {
+   dev_warn(&pf->pdev->dev, "This client %s is being 
instanciated at probe\n",
+client->name);
}
 
/* Add the client instance to the instance list */
-   cdev = i40e_client_add_instance(pf, client);
+   cdev = i40e_client_add_instance(pf, client, &existing);
if (!cdev)
continue;
 
-   /* Also up the ref_cnt of no. of instances of this client */
-   atomic_inc(&client->ref_cnt);
-   dev_info(&pf->pdev->dev, "Added instance of Client %s to PF%d 
bus=0x%02x func=0x%02x\n",
-client->name, pf->hw.pf_id,
-pf->hw.bus.device, pf->hw.bus.func);
+   if (!existing) {
+   /* Also up the ref_cnt for no. of instances of this
+* client.
+*/
+   atomic_inc(&client->ref_cnt);
+   dev_info(&pf->pdev->dev, "Added instance of Client %s 
to PF%d bus=0x%02x func=0x%02x\n",
+  

Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread pravin shelar
On Wed, Aug 24, 2016 at 11:53 AM, David Ahern  wrote:
> On 8/24/16 11:41 AM, pravin shelar wrote:
>> You also need to change pop_mpls().
>
> What change is needed in pop_mpls? It already resets the mac_header and if 
> MPLS labels are removed there is no need to set network_header. I take it you 
> mean if the protocol is still MPLS and there are still labels then the 
> network header needs to be set and that means finding the bottom label. Does 
> OVS set the bottom of stack bit? From what I can tell OVS is not parsing the 
> MPLS label so no requirement that BOS is set. Without that there is no way to 
> tell when the labels are done short of guessing.
>

OVS mpls push and pop action works on outer most mpls label. So
according to new mpls offsets tracking scheme on mpls_pop action you
need to adjust skb network offset.


Re: [PATCH net-next 1/6] net: dsa: b53: Initialize ds->drv in b53_switch_alloc

2016-08-24 Thread Florian Fainelli
Le 24/08/2016 à 18:33, Florian Fainelli a écrit :
> In order to alloc drivers to override specific dsa_switch_driver
> callbacks, initialize ds->drv to b53_switch_ops earlier, which avoids
> having to expose this structure to glue drivers.
> 
> Signed-off-by: Florian Fainelli 

This will need some refactoring after Vivien's "net: dsa: rename switch
operations structure" patch.

> ---
>  drivers/net/dsa/b53/b53_common.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/dsa/b53/b53_common.c 
> b/drivers/net/dsa/b53/b53_common.c
> index 65ecb51f99e5..30377ceb1928 100644
> --- a/drivers/net/dsa/b53/b53_common.c
> +++ b/drivers/net/dsa/b53/b53_common.c
> @@ -1602,7 +1602,6 @@ static const struct b53_chip_data b53_switch_chips[] = {
>  
>  static int b53_switch_init(struct b53_device *dev)
>  {
> - struct dsa_switch *ds = dev->ds;
>   unsigned int i;
>   int ret;
>  
> @@ -1618,7 +1617,6 @@ static int b53_switch_init(struct b53_device *dev)
>   dev->vta_regs[1] = chip->vta_regs[1];
>   dev->vta_regs[2] = chip->vta_regs[2];
>   dev->jumbo_pm_reg = chip->jumbo_pm_reg;
> - ds->drv = &b53_switch_ops;
>   dev->cpu_port = chip->cpu_port;
>   dev->num_vlans = chip->vlans;
>   dev->num_arl_entries = chip->arl_entries;
> @@ -1706,6 +1704,7 @@ struct b53_device *b53_switch_alloc(struct device *base,
>   dev->ds = ds;
>   dev->priv = priv;
>   dev->ops = ops;
> + ds->drv = &b53_switch_ops;
>   mutex_init(&dev->reg_mutex);
>   mutex_init(&dev->stats_mutex);
>  
> 


-- 
Florian


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-24 Thread Nikolay Borisov
On Thu, Aug 25, 2016 at 12:40 AM, Hannes Frederic Sowa
 wrote:
> On 24.08.2016 16:24, Nikolay Borisov wrote:
[SNIP]
>
> One commit which could have to do with that is
>
> commit fc64869c48494a401b1fb627c9ecc4e6c1d74b0d
> Author: Andrey Ryabinin 
> Date:   Wed May 18 19:19:27 2016 +0300
>
> net: sock: move ->sk_shutdown out of bitfields.
>
> but that is only a wild guess.
>
> Which unix_sock did you extract specifically in the url you provided? In
> unix_notinflight we are specifically checking an unix domain socket that
> is itself being transferred over another af_unix domain socket and not
> the unix domain socket being released at this point.

So this is the state of the socket that is being passed to
unix_notinflight. I have a complete crashdump so if you need more info
to diagnose it I'm happy to provide it. I'm not too familiar with the
code in question so I will need a bit of time to grasp what actually
is happening.

>
> Can you reproduce this and maybe also with a newer kernel?

Unfortunately I cannot reproduce this since it happened on a
production server nor can I change the kernel. But clearly there is
something wrong, and given that this is a stable kernel and no
relevant changes have gone in latest stable I believe the problem
(albeit hardly reproducible) would still persist.

>
> Thanks for the report,
> Hannes
>


[PATCH v2] net: macb: Increase DMA TX buffer size

2016-08-24 Thread Xander Huff
From: Nathan Sullivan 

In recent testing with the RT patchset, we have seen cases where the
transmit ring can fill even with up to 200 txbds in the ring. Increase the
size of the DMA TX ring to avoid overruns.

Signed-off-by: Xander Huff 
Signed-off-by: Nathan Sullivan 
---
 drivers/net/ethernet/cadence/macb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index 3256839..3efddb7 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -40,7 +40,7 @@
 #define RX_RING_SIZE   512 /* must be power of 2 */
 #define RX_RING_BYTES  (sizeof(struct macb_dma_desc) * RX_RING_SIZE)
 
-#define TX_RING_SIZE   128 /* must be power of 2 */
+#define TX_RING_SIZE   512 /* must be power of 2 */
 #define TX_RING_BYTES  (sizeof(struct macb_dma_desc) * TX_RING_SIZE)
 
 /* level of occupied TX descriptors under which we wake up TX process */
-- 
1.9.1



Re: [PATCH v2 2/6] cgroup: add support for eBPF programs

2016-08-24 Thread Daniel Mack
Hi Tejun,

On 08/24/2016 11:54 PM, Tejun Heo wrote:
> On Wed, Aug 24, 2016 at 10:24:19PM +0200, Daniel Mack wrote:
>> +void cgroup_bpf_free(struct cgroup *cgrp)
>> +{
>> +unsigned int type;
>> +
>> +rcu_read_lock();
>> +
>> +for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) {
>> +if (!cgrp->bpf.prog[type])
>> +continue;
>> +
>> +bpf_prog_put(cgrp->bpf.prog[type]);
>> +static_branch_dec(&cgroup_bpf_enabled_key);
>> +}
>> +
>> +rcu_read_unlock();
> 
> These rcu locking seem suspicious to me.  RCU locking on writer side
> is usually bogus.  We sometimes do it to work around locking
> assertions in accessors but it's a better idea to make the assertions
> better in those cases - e.g. sth like assert_mylock_or_rcu_locked().

Right, in this case, it is unnecessary, as the bpf.prog[] is not under RCU.

>> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
>> +{
>> +unsigned int type;
>> +
>> +rcu_read_lock();
> 
> Ditto.
> 
>> +for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++)
>> +rcu_assign_pointer(cgrp->bpf.prog_effective[type],
>> +rcu_dereference(parent->bpf.prog_effective[type]));

Okay, yes. We're under cgroup_mutex write-path protection here, so
that's unnecessary too.

>> +void __cgroup_bpf_update(struct cgroup *cgrp,
>> + struct cgroup *parent,
>> + struct bpf_prog *prog,
>> + enum bpf_attach_type type)
>> +{
>> +struct bpf_prog *old_prog, *effective;
>> +struct cgroup_subsys_state *pos;
>> +
>> +rcu_read_lock();
> 
> Ditto.

Yes, agreed, as above.

>> +old_prog = xchg(cgrp->bpf.prog + type, prog);
>> +if (old_prog) {
>> +bpf_prog_put(old_prog);
>> +static_branch_dec(&cgroup_bpf_enabled_key);
>> +}
>> +
>> +if (prog)
>> +static_branch_inc(&cgroup_bpf_enabled_key);
> 
> Minor but probably better to inc first and then dec so that you can
> avoid unnecessary enabled -> disabled -> enabled sequence.

Good point. Will fix.

>> +rcu_read_unlock();
>> +
>> +css_for_each_descendant_pre(pos, &cgrp->self) {
> 
> On the other hand, this walk actually requires rcu read locking unless
> you're holding cgroup_mutex.

I am - this function is always called with cgroup_mutex held through the
wrapper in kernel/cgroup.c.

Thanks a lot - will put all that changes in v3.


Daniel


Continue a discussion about the netlink interface

2016-08-24 Thread Andrei Vagin
Hello,

I want to return to a discussion about the netlink interface and how to
use it out of the network subsystem.

I'm developing a new interface to get information about processes
(task_diag). task_diag is like socket_diag but for processes. [0]

In the first two versions [1] [2], I used the netlink interface to
communicate with kernel. There was a discussion [4], that the netlink
interface is not suitable for this task and it has a few known issues
about security, so probably it should not be used for task_diag.

Then, in a third version [3], I used a proc transaction file
instead of the netlink interface. But it was not accepted too, because
we already have the netlink interface[5] and it's a bad idea to add one
more similar less-generic interface.

Then Andy Lutomirski suggested to rework netlink [6], but nobody
answered on his suggestion.

Can we continue this discussion and find a final solution?

Maybe we need to schedule a face-to-face meeting on one of conferences?
It may be Linux Plumbers, for example.

Here is Andy's idea how the netlink interface can be reworked:

On Wed, May 04, 2016 at 08:39:51PM -0700, Andy Lutomirski wrote:
> Netlink had, and possibly still has, tons of serious security bugs
> involving code checking send() callers' creds.  I found and fixed a
> few a couple years ago.  To reiterate once again, send() CANNOT use
> caller creds safely.  (I feel like I say this once every few weeks.
> It's getting old.)
>
> I realize that it's convenient to use a socket as a context to keep
> state between syscalls, but it has some annoying side effects:
>
>  - It makes people want to rely on send()'s caller's creds.
>
>  - It's miserable in combination with seccomp.
>
>  - It doesn't play nicely with namespaces.
>
>  - It makes me wonder why things like task_diag, which have nothing to
> do with networking, seem to get tangled up with networking.
>
>
> Would it be worth considering adding a parallel interface, using it
> for new things, and slowly migrating old use cases over?
>
> int issue_kernel_command(int ns, int command, const struct iovec *iov,
> int iovcnt, int flags);
>
> ns is an actual namespace fd or:
>
> KERNEL_COMMAND_CURRENT_NETNS
> KERNEL_COMMAND_CURRENT_PIDNS
> etc, or a special one:
> KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
> non-root namespace.
>
> KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
> relevant current ns is the init namespace.  (This feature is optional,
> but it would allow gradually namespacing global things.)
> command is an enumerated command.  Each command implies a namespace
> type, and, if you feed this thing the wrong namespace type, you get
> EINVAL.  The high bit of command indicates whether it's read-only
> command.
>
> iov gives a command in the format expected, which, for the most part,
> would be a netlink message.
>
> The return value is an fd that you can call read/readv on to read the
> response.  It's not a socket (or at least you can't do normal socket
> operations on it if it is a socket behind the scenes).  The
> implementation of read() promises *not* to look at caller creds.  The
> returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
>
> When you've read all the data, all you can do is close the fd.  You
> can't issue another command on the same fd.  You also can't call
> write() or send() on the fd unless someone has a good reason why you
> should be able to and why it's safe.  You can't issue another command
> on the same fd.
>
>
> I imagine that the implementation could re-use a bunch of netlink code
> under the hood.

[6] https://www.mail-archive.com/netdev@vger.kernel.org/msg109212.html
[5] https://lkml.org/lkml/2016/5/4/785
[4] https://lkml.org/lkml/2015/7/6/708
[3] https://lwn.net/Articles/683371/
[2] https://lkml.org/lkml/2015/7/6/142
[1] https://lwn.net/Articles/633622/
[0] https://criu.org/Task-diag

Thanks,
Andrei


Re: [PATCH v2 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-24 Thread Tejun Heo
Hello,

On Wed, Aug 24, 2016 at 10:24:20PM +0200, Daniel Mack wrote:
>  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
> size)
>  {
>   union bpf_attr attr = {};
> @@ -888,6 +957,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
> uattr, unsigned int, siz
>   case BPF_OBJ_GET:
>   err = bpf_obj_get(&attr);
>   break;
> +
> +#ifdef CONFIG_CGROUP_BPF
> + case BPF_PROG_ATTACH:
> + err = bpf_prog_attach(&attr);
> + break;
> + case BPF_PROG_DETACH:
> + err = bpf_prog_detach(&attr);
> + break;
> +#endif

So, this is one thing I haven't realized while pushing for "just embed
it in cgroup".  Breaking it out to a separate controller allows using
its own locking instead of having to piggyback on cgroup_mutex.  That
said, as long as cgroup_mutex is not nested inside some inner mutex,
this shouldn't be a problem.  I still think the embedding is fine and
whether we make it an implicit controller or not doesn't affect
userland API at all, so it's an implementation detail that we can
change later if necessary.

Thanks.

-- 
tejun


[PATCH v2] Revert "phy: IRQ cannot be shared"

2016-08-24 Thread Xander Huff
This reverts:
  commit 33c133cc7598 ("phy: IRQ cannot be shared")

On hardware with multiple PHY devices hooked up to the same IRQ line, allow
them to share it.

Sergei Shtylyov says:
  "I'm not sure now what was the reason I concluded that the IRQ sharing
  was impossible... most probably I thought that the kernel IRQ handling
  code exited the loop over the IRQ actions once IRQ_HANDLED was returned
  -- which is obviously not so in reality..."

Signed-off-by: Xander Huff 
Signed-off-by: Nathan Sullivan 
---
Note: this reverted code fails "CHECK: Alignment should match open
parentesis"
---
 drivers/net/phy/phy.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c5dc2c36..c6f6683 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -722,8 +722,10 @@ phy_err:
 int phy_start_interrupts(struct phy_device *phydev)
 {
atomic_set(&phydev->irq_disable, 0);
-   if (request_irq(phydev->irq, phy_interrupt, 0, "phy_interrupt",
-   phydev) < 0) {
+   if (request_irq(phydev->irq, phy_interrupt,
+   IRQF_SHARED,
+   "phy_interrupt",
+   phydev) < 0) {
pr_warn("%s: Can't get IRQ %d (PHY)\n",
phydev->mdio.bus->name, phydev->irq);
phydev->irq = PHY_POLL;
-- 
1.9.1



Re: [PATCH] phy: request shared IRQ

2016-08-24 Thread Xander Huff

On 8/24/2016 1:41 PM, Sergei Shtylyov wrote:

Hello.

On 08/24/2016 08:53 PM, Xander Huff wrote:


From: Nathan Sullivan 

On hardware with multiple PHY devices hooked up to the same IRQ line, allow
them to share it.


Note that it had been allowed until my (erroneous?) commit
33c133cc7598e60976a069344910d63e56cc4401 ("phy: IRQ cannot be shared"), so I'd
like this commit just reverted instead...
I'm not sure now what was the reason I concluded that the IRQ sharing was
impossible... most probably I thought that the kernel IRQ handling code exited
the loop over the IRQ actions once IRQ_HANDLED was returned -- which is
obviously not so in reality...

MBR, Sergei


Thanks for the suggestion, Sergei. I'll do just that.

--
Xander Huff
Staff Software Engineer
National Instruments


Re: [PATCH v2 2/6] cgroup: add support for eBPF programs

2016-08-24 Thread Tejun Heo
Hello, Daniel.

On Wed, Aug 24, 2016 at 10:24:19PM +0200, Daniel Mack wrote:
> +void cgroup_bpf_free(struct cgroup *cgrp)
> +{
> + unsigned int type;
> +
> + rcu_read_lock();
> +
> + for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) {
> + if (!cgrp->bpf.prog[type])
> + continue;
> +
> + bpf_prog_put(cgrp->bpf.prog[type]);
> + static_branch_dec(&cgroup_bpf_enabled_key);
> + }
> +
> + rcu_read_unlock();

These rcu locking seem suspicious to me.  RCU locking on writer side
is usually bogus.  We sometimes do it to work around locking
assertions in accessors but it's a better idea to make the assertions
better in those cases - e.g. sth like assert_mylock_or_rcu_locked().

> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
> +{
> + unsigned int type;
> +
> + rcu_read_lock();

Ditto.

> + for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++)
> + rcu_assign_pointer(cgrp->bpf.prog_effective[type],
> + rcu_dereference(parent->bpf.prog_effective[type]));
> +
> + rcu_read_unlock();
> +}
...
> +void __cgroup_bpf_update(struct cgroup *cgrp,
> +  struct cgroup *parent,
> +  struct bpf_prog *prog,
> +  enum bpf_attach_type type)
> +{
> + struct bpf_prog *old_prog, *effective;
> + struct cgroup_subsys_state *pos;
> +
> + rcu_read_lock();

Ditto.

> + old_prog = xchg(cgrp->bpf.prog + type, prog);
> + if (old_prog) {
> + bpf_prog_put(old_prog);
> + static_branch_dec(&cgroup_bpf_enabled_key);
> + }
> +
> + if (prog)
> + static_branch_inc(&cgroup_bpf_enabled_key);

Minor but probably better to inc first and then dec so that you can
avoid unnecessary enabled -> disabled -> enabled sequence.

> + effective = (!prog && parent) ?
> + rcu_dereference(parent->bpf.prog_effective[type]) : prog;

If this is what's triggering rcu warnings, there's an accessor to use
in these situations.

> + rcu_read_unlock();
> +
> + css_for_each_descendant_pre(pos, &cgrp->self) {

On the other hand, this walk actually requires rcu read locking unless
you're holding cgroup_mutex.

Thanks.

-- 
tejun


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-24 Thread Hannes Frederic Sowa
On 24.08.2016 16:24, Nikolay Borisov wrote:
> Hello, 
> 
> I hit the following BUG: 
> 
> [1851513.239831] [ cut here ]
> [1851513.240079] kernel BUG at net/unix/garbage.c:149!
> [1851513.240313] invalid opcode:  [#1] SMP 
> [1851513.248320] CPU: 37 PID: 11683 Comm: nginx Tainted: G   O
> 4.4.14-clouder3 #26
> [1851513.248719] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
> [1851513.248966] task: 883b0f6f ti: 880189cf task.ti: 
> 880189cf
> [1851513.249361] RIP: 0010:[]  [] 
> unix_notinflight+0x8d/0x90
> [1851513.249846] RSP: 0018:880189cf3cf8  EFLAGS: 00010246
> [1851513.250082] RAX: 883b05491968 RBX: 883b05491680 RCX: 
> 8807f9967330
> [1851513.250476] RDX: 0001 RSI: 882e6d8bae00 RDI: 
> 82073f10
> [1851513.250886] RBP: 880189cf3d08 R08: 880cbc70e200 R09: 
> 00018021
> [1851513.251280] R10: 883fff3b9dc0 R11: ea0032f1c380 R12: 
> 883fbaf5
> [1851513.251674] R13: 815f6354 R14: 881a7c77b140 R15: 
> 881a7c7792c0
> [1851513.252083] FS:  7f4f19573720() GS:883fff3a() 
> knlGS:
> [1851513.252481] CS:  0010 DS:  ES:  CR0: 80050033
> [1851513.252717] CR2: 013062d8 CR3: 001712f32000 CR4: 
> 001406e0
> [1851513.253116] Stack:
> [1851513.253345]   880189cf3d40 880189cf3d28 
> 815f4383
> [1851513.254022]  8839ee11a800 8839ee11a800 880189cf3d60 
> 815f53b8
> [1851513.254685]   883406788de0  
> 
> [1851513.255360] Call Trace:
> [1851513.255594]  [] unix_detach_fds.isra.19+0x43/0x50
> [1851513.255851]  [] unix_destruct_scm+0x48/0x80
> [1851513.256090]  [] skb_release_head_state+0x4f/0xb0
> [1851513.256328]  [] skb_release_all+0x12/0x30
> [1851513.256564]  [] kfree_skb+0x32/0xa0
> [1851513.256810]  [] unix_release_sock+0x1e4/0x2c0
> [1851513.257046]  [] unix_release+0x20/0x30
> [1851513.257284]  [] sock_release+0x1f/0x80
> [1851513.257521]  [] sock_close+0x12/0x20
> [1851513.257769]  [] __fput+0xea/0x1f0
> [1851513.258005]  [] fput+0xe/0x10
> [1851513.258244]  [] task_work_run+0x7f/0xb0
> [1851513.258488]  [] exit_to_usermode_loop+0xc0/0xd0
> [1851513.258728]  [] syscall_return_slowpath+0x80/0xf0
> [1851513.258983]  [] int_ret_from_sys_call+0x25/0x9f
> [1851513.259222] Code: 7e 5b 41 5c 5d c3 48 8b 8b e8 02 00 00 48 8b 93 f0 02 
> 00 00 48 89 51 08 48 89 0a 48 89 83 e8 02 00 00 48 89 83 f0 02 00 00 eb b8 
> <0f> 0b 90 0f 1f 44 00 00 55 48 c7 c7 10 3f 07 82 48 89 e5 41 54 
> [1851513.268473] RIP  [] unix_notinflight+0x8d/0x90
> [1851513.268793]  RSP 
> 
> That's essentially BUG_ON(list_empty(&u->link));
> 
> I see that all the code involving the ->link member hasn't really been 
> touched since it was introduced in 2007. So this must be a latent bug. 
> This is the first time I've observed it. The state 
> of the struct unix_sock can be found here http://sprunge.us/WCMW . Evidently, 
> there are no inflight sockets. 

One commit which could have to do with that is

commit fc64869c48494a401b1fb627c9ecc4e6c1d74b0d
Author: Andrey Ryabinin 
Date:   Wed May 18 19:19:27 2016 +0300

net: sock: move ->sk_shutdown out of bitfields.

but that is only a wild guess.

Which unix_sock did you extract specifically in the url you provided? In
unix_notinflight we are specifically checking an unix domain socket that
is itself being transferred over another af_unix domain socket and not
the unix domain socket being released at this point.

Can you reproduce this and maybe also with a newer kernel?

Thanks for the report,
Hannes



[PATCH] net: systemport: Fix ordering in intrl2_*_mask_clear macro

2016-08-24 Thread Florian Fainelli
Since we keep shadow copies of which interrupt sources are enabled
through the intrl2_*_mask_{set,clear} macros, make sure that the
ordering in which we do these two operations: update the copy, then
unmask the register is correct.

This is not currently a problem because we actually do not use them, but
we will in a subsequent patch optimizing register accesses, so better be
safe here.

Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC 
driver")
Signed-off-by: Florian Fainelli 
---
David,

This is intentionally targetting the "net-next" tree since it is
not yet a problem, yet this is still technically a bugfix. No need
to backport this to -stable or anything.

Thanks!

 drivers/net/ethernet/broadcom/bcmsysport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index b2d30863caeb..2059911014db 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -58,8 +58,8 @@ BCM_SYSPORT_IO_MACRO(topctrl, SYS_PORT_TOPCTRL_OFFSET);
 static inline void intrl2_##which##_mask_clear(struct bcm_sysport_priv *priv, \
u32 mask)   \
 {  \
-   intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \
priv->irq##which##_mask &= ~(mask); \
+   intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \
 }  \
 static inline void intrl2_##which##_mask_set(struct bcm_sysport_priv *priv, \
u32 mask)   \
-- 
2.7.4



[PATCH V2] dt: net: enhance DWC EQoS binding to support Tegra186

2016-08-24 Thread Stephen Warren
From: Stephen Warren 

The Synopsys DWC EQoS is a configurable IP block which supports multiple
options for bus type, clocking and reset structure, and feature list.
Extend the DT binding to define a "compatible value" for the configuration
contained in NVIDIA's Tegra186 SoC, and define some new properties and
list property entries required by that configuration.

Signed-off-by: Stephen Warren 
---
v2:
* Add an explicit compatible value for the Axis SoC's version of the EQOS
  IP; this allows the driver to handle any SoC-specific integration quirks
  that are required, rather than only knowing about the IP block in
  isolation. This is good general DT practice. The existing value is still
  documented to support existing DTs.
* Reworked the list of clocks the binding requires:
  - Combined "tx" and "phy_ref_clk"; for GMII/RGMII configurations, these
are the same thing.
  - Added extra description to the "rx" and "tx" clocks, to make it clear
exactly which HW clock they represent.
  - Made the new "tx" and "slave_bus" names more prominent than the
original "phy_ref_clk" and "apb_pclk". The new names are more generic
and should work for any enhanced version of the binding (e.g. to
support additional PHY types). New compatible values will hopefully
choose to require the new names.
* Added a couple extra clocks to the list that may need to be supported in
  future binding revisions.
* Fixed a typo; "clocks" -> "resets".
---
 .../bindings/net/snps,dwc-qos-ethernet.txt | 75 --
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt 
b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
index 51f8d2eba8d8..1d028259824a 100644
--- a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
+++ b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
@@ -1,21 +1,87 @@
 * Synopsys DWC Ethernet QoS IP version 4.10 driver (GMAC)
 
+This binding supports the Synopsys Designware Ethernet QoS (Quality Of Service)
+IP block. The IP supports multiple options for bus type, clocking and reset
+structure, and feature list. Consequently, a number of properties and list
+entries in properties are marked as optional, or only required in specific HW
+configurations.
 
 Required properties:
-- compatible: Should be "snps,dwc-qos-ethernet-4.10"
+- compatible: One of:
+  - "axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10"
+Represents the IP core when integrated into the Axis ARTPEC-6 SoC.
+  - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10"
+Represents the IP core when integrated into the NVIDIA Tegra186 SoC.
+  - "snps,dwc-qos-ethernet-4.10"
+This combination is deprecated. It should be treated as equivalent to
+"axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10". It is supported to be
+compatible with earlier revisions of this binding.
 - reg: Address and length of the register set for the device
-- clocks: Phandles to the reference clock and the bus clock
-- clock-names: Should be "phy_ref_clk" for the reference clock and "apb_pclk"
-  for the bus clock.
+- clocks: Phandle and clock specifiers for each entry in clock-names, in the
+  same order. See ../clock/clock-bindings.txt.
+- clock-names: May contain any/all of the following depending on the IP
+  configuration, in any order:
+  - "tx"
+(Alternate name "phy_ref_clk"; only one alternate must appear.)
+The EQOS transmit path clock. The HW signal name is clk_tx_i.
+In some configurations (e.g. GMII/RGMII), this clock also drives the PHY TX
+path. In other configurations, other clocks (such as tx_125, rmii) may
+drive the PHY TX path.
+  - "rx"
+The EQOS receive path clock. The HW signal name is clk_rx_i.
+In some configurations (e.g. GMII/RGMII), this clock also drives the PHY RX
+path. In other configurations, other clocks (such as rx_125, pmarx_0,
+pmarx_1, rmii) may drive the PHY RX path.
+  - "slave_bus"
+(Alternate name "apb_pclk"; only one alternate must appear.)
+The CPU/slave-bus (CSR) interface clock. Despite the name, this applies to
+any bus type; APB, AHB, AXI, etc. The HW signal name is hclk_i (AHB) or
+clk_csr_i (other buses).
+  - "master_bus"
+The master bus interface clock. Only required in configurations that use a
+separate clock for the master and slave bus interfaces. The HW signal name
+is hclk_i (AHB) or aclk_i (AXI).
+  - "ptp_ref"
+The PTP reference clock. The HW signal name is clk_ptp_ref_i.
+
+  Note: Support for additional IP configurations may require adding the
+  following clocks to this list in the future: clk_rx_125_i, clk_tx_125_i,
+  clk_pmarx_0_i, clk_pmarx1_i, clk_rmii_i, clk_revmii_rx_i, clk_revmii_tx_i.
+
+  The following compatible values require the following set of clocks:
+  - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10":
+- "slave_bus"
+- "master_bus"
+- "rx"
+- "tx"
+-

Re: [PATCH net-next] net: dsa: rename switch operations structure

2016-08-24 Thread Florian Fainelli
On 08/23/2016 09:38 AM, Vivien Didelot wrote:
> Now that the dsa_switch_driver structure contains only function pointers
> as it is supposed to, rename it to the more appropriate dsa_switch_ops,
> uniformly to any other operations structure in the kernel.
> 
> No functional changes here, basically just the result of something like:
> s/dsa_switch_driver *drv/dsa_switch_ops *ops/g
> 
> However keep the {un,}register_switch_driver functions and their
> dsa_switch_drivers list as is, since they represent the -- likely to be
> deprecated soon -- legacy DSA registration framework.
> 
> In the meantime, also fix the following checks from checkpatch.pl to
> make it happy with this patch:
> 
> CHECK: Comparison to NULL could be written "!ops"
> #403: FILE: net/dsa/dsa.c:470:
> + if (ops == NULL) {
> 
> CHECK: Comparison to NULL could be written "ds->ops->get_strings"
> #773: FILE: net/dsa/slave.c:697:
> + if (ds->ops->get_strings != NULL)
> 
> CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats"
> #824: FILE: net/dsa/slave.c:785:
> + if (ds->ops->get_ethtool_stats != NULL)
> 
> CHECK: Comparison to NULL could be written "ds->ops->get_sset_count"
> #835: FILE: net/dsa/slave.c:798:
> + if (ds->ops->get_sset_count != NULL)
> 
> total: 0 errors, 0 warnings, 4 checks, 784 lines checked
> 
> Signed-off-by: Vivien Didelot 

Acked-by: Florian Fainelli 

Thanks!
-- 
Florian


Re: [PATCH 0/5] Networking cgroup controller

2016-08-24 Thread Tejun Heo
Hello, Anoop.

On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
> This patchset introduces a cgroup controller for the networking subsystem as a
> whole. As of now, this controller will be used for:
> 
> * Limiting the specific ports that a process in a cgroup is allowed to bind
>   to or listen on. For example, you can say that all the processes in a
>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, 
> which
>   guarantees that the remaining ports will be available for other processes.
> 
> * Restricting which DSCP values processes can use with their sockets. For
>   example, you can say that all the processes in a cgroup can only send
>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>   192 to 255).
> 
> * Limiting the total number of udp ports that can be used by a process in a
>   cgroup. For example, you can say that all the processes in one cgroup are
>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>   ports that can be used by all processes is limited, this is useful for
>   rationing out the ports to different process groups.
> 
> In the future, more networking-related properties may be added to this
> controller.

Thanks for working on this; however, I share the sentiment expressed
by others that this looks like too piecemeal an approach.  If there
are no alternatives, we surely should consider this but it at least
*looks* like bpf should be able to cover the same functionalities
without having to revise and extend in-kernel capabilities constantly.

Thanks.

-- 
tejun


Re: [net] openvswitch: Allow deferred action fifo to expand during run time

2016-08-24 Thread Lance Richardson
> From: "David Miller" 
> To: az...@ovn.org
> Cc: d...@openvswitch.com, netdev@vger.kernel.org
> Sent: Friday, March 18, 2016 5:19:09 PM
> Subject: Re: [net] openvswitch: Allow deferred action fifo to expand during 
> run time
> 
> From: Andy Zhou 
> Date: Thu, 17 Mar 2016 21:32:13 -0700
> 
> > Current openvswitch implementation allows up to 10 recirculation actions
> > for each packet. This limit was sufficient for most use cases in the
> > past, but with more new features, such as supporting connection
> > tracking, and testing in larger scale network environment,
> > This limit may be too restrictive.
>  ...
> 
> Actions that need to recirculate that many times are extremely poorly
> designed, and will have significant performance problems.
> 
> I think the way rules are put together and processed should be redone
> before we do insane stuff like this.
> 
> There is no way I'm applying a patch like this, sorry.
> 

Apologies for coming into this thread so late, I happened on it after finding
out that this is actually an issue in some production networks.

The need to buffer so many deferred actions seems to be mostly due to having
relatively simple rules (that have, say, one or two recirculations) that get
multiplied per packet by the number of egress ports.

For example, a configuration with 11 or more OVS bond ports in balance-tcp
mode (which needs one recirculation) will exceed the deferred action fifo limit
of 10 every time a broadcast (or multicast or unknown unicast) is forwarded by
the OVS bridge because one entry will be consumed by each egress port. Since
the order in which egress ports are handled is deterministic, this means
e.g. broadcast ARP requests will only ever make it out the first 10 
bond ports in this scenario.

Note that bonding isn't necessary to have this issue, it just makes for a
relatively straightforward example.

Andy's patch certainly seems to be an improvement on this situation,
but maybe there another/better way.

Regards,

   Lance


Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds

2016-08-24 Thread John W. Linville
On Wed, Aug 24, 2016 at 10:33:04AM -0400, John W. Linville wrote:
> On Wed, Aug 24, 2016 at 04:29:22AM +, Yuval Mintz wrote:
> > > This patch series provides following support
> > > a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/
> > >Encoding/Connector types which are common across SFP/SFP+ (SFF-8472)
> > >and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files.
> > > b) Support for diagnostics information for QSFP Plus/QSFP28 modules
> > >based on SFF-8436/SFF-8636
> > > c) Supporting 25G/50G/100G speeds in supported/advertising fields
> > > d) Tested across various QSFP+/QSFP28 Copper/Optical modules
> > > 
> > > Standards for QSFP+/QSFP28
> > > a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016
> > > b) SFF-8024 Rev 4.0 dated May 31, 2016
> > > 
> > > v4:
> > >   Sync ethtool-copy.h to kernel commit
> > > 89da45b8b5b2187734a11038b8593714f964ffd1
> > >   which includes support for 50G base SR2
> > 
> > What about the man-page?
> 
> I can just apply your man page patch on top.

And, I did.

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.


Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds

2016-08-24 Thread John W. Linville
I have pushed this series. I did modify patches 3 and 4 a bit,
to properly update Makefile.am in order to keep "make distcheck"
from failing -- please be more careful in the future.

John

P.S. I have not yet tagged this as an official release, so please test!

On Tue, Aug 23, 2016 at 06:30:29AM -0700, Vidya Sagar Ravipati wrote:
> From: Vidya Sagar Ravipati 
> 
> This patch seryies provides following support
> a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/
>Encoding/Connector types which are common across SFP/SFP+ (SFF-8472)
>and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files.
> b) Support for diagnostics information for QSFP Plus/QSFP28 modules
>based on SFF-8436/SFF-8636
> c) Supporting 25G/50G/100G speeds in supported/advertising fields
> d) Tested across various QSFP+/QSFP28 Copper/Optical modules
> 
> Standards for QSFP+/QSFP28
> a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016
> b) SFF-8024 Rev 4.0 dated May 31, 2016
> 
> v4:
>   Sync ethtool-copy.h to kernel commit 
> 89da45b8b5b2187734a11038b8593714f964ffd1
>   which includes support for 50G base SR2
> 
> v3:
>  Review comments from Ben Hutchings:
>Make sff diags structure common across sfpdiag.c and
>qsfp.c and use common function to print common threshold
>values.
>  Review comments from Rami Rosen:
>Cleanup description messages.
> 
> v2:
>   Included support for 25G/50G/100G speeds in supported/
>   advertised speed modes
>   Review comments from Ben Hutchings:
> Split the sff-8024 reorganzing patch and QSFP+/QSFP28
> patch
> Fixed all checkpatch warnings (except couple of over 80 character)
> 
> v1:
>   Support for SFF-8636 Rev 2.7
>   Review comments from Ben Hutchings:
>Updating copyright holders information for QSFP
>Reusing the common functions and macros across sfpid and qsfp
> 
> Vidya Sagar Ravipati (4):
>   ethtool-copy.h:sync with net
>   ethtool:Reorganizing  SFF-8024 fields for SFP/QSFP
>   ethtool:QSFP Plus/QSFP28 Diagnostics Information Support
>   ethtool: Enhancing link mode bits to support 25G/50G/100G
> 
>  Makefile.am|   2 +-
>  ethtool-copy.h |  18 +-
>  ethtool.c  |  35 +++
>  internal.h |   3 +
>  qsfp.c | 788 
> +
>  qsfp.h | 595 +++
>  sff-common.c   | 304 ++
>  sff-common.h   | 189 ++
>  sfpdiag.c  | 105 +---
>  sfpid.c| 103 +---
>  10 files changed, 1945 insertions(+), 197 deletions(-)
>  create mode 100644 qsfp.c
>  create mode 100644 qsfp.h
>  create mode 100644 sff-common.c
>  create mode 100644 sff-common.h
> 
> -- 
> 2.1.4
> 
> 

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.


Re: [PATCH] dt: net: enhance DWC EQoS binding to support Tegra186

2016-08-24 Thread Stephen Warren

On 08/24/2016 02:10 AM, Lars Persson wrote:

On 08/23/2016 10:47 PM, Stephen Warren wrote:

The Synopsys DWC EQoS is a configurable IP block which supports multiple
options for bus type, clocking and reset structure, and feature list.
Extend the DT binding to define a "compatible value" for the
configuration
contained in NVIDIA's Tegra186 SoC, and define some new properties and
list property entries required by that configuration.



diff --git
a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt



 Optional properties:



+- phy-reset-gpios: Phandle and specifier for any GPIO used to reset
the PHY.
+  See ../gpio/gpio.txt.


IMHO the phy reset gpio belongs in the binding for the PHY. I notice
some other ethernet drivers have this, but the PHY should be managed
entirely through the phylib and any special handling for reset can be
hidden in phy specific drivers.


I can see that argument; this GPIO certainly does control the PHY so 
seems part of it. However, presumably this GPIO must be manipulated 
before being able to communicate with the PHY at all, and hence 
instantiate any driver that might control the PHY. As such, this seems 
more like a property of the MDIO bus than the PHY itself, even if it 
electrically is part of the PHY. Also, 
Documentation/devicetree/bindings/net/phy.txt doesn't contain any 
phy-reset-gpios property or similar, so we'd have to add that if we 
wanted to rely upon it.


For now I'll post V2 without changing this, but I can always post V3 if 
needed.


[PATCH] iproute: disallow ip rule del without parameters

2016-08-24 Thread Andrey Jr. Melnikov
Disallow run `ip rule del` without any parameter to avoid delete any first
rule from table.

Signed-off-by: Andrey Jr. Melnikov 
---

diff --git a/ip/iprule.c b/ip/iprule.c
index 8f24206..70562c5 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -346,6 +346,11 @@ static int iprule_modify(int cmd, int argc, char **argv)
req.r.rtm_type = RTN_UNICAST;
}
 
+   if (cmd == RTM_DELRULE && argc == 0) {
+   fprintf(stderr, "\"ip rule del\" requires arguments.\n");
+   return -1;
+   }
+
while (argc > 0) {
if (strcmp(*argv, "not") == 0) {
req.r.rtm_flags |= FIB_RULE_INVERT;


Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds

2016-08-24 Thread Vidya Sagar Ravipati
On Wed, Aug 24, 2016 at 1:01 PM, John W. Linville
 wrote:
> I have pushed this series. I did modify patches 3 and 4 a bit,
> to properly update Makefile.am in order to keep "make distcheck"
> from failing -- please be more careful in the future.
>
Thanks for pushing the patches. Not aware of "make distcheck" and
will be careful going forward.
Quickly validated the build on SFP+/QSFP+/QSFP28  and everything seems fine

> John
>
> P.S. I have not yet tagged this as an official release, so please test!
>
> On Tue, Aug 23, 2016 at 06:30:29AM -0700, Vidya Sagar Ravipati wrote:
>> From: Vidya Sagar Ravipati 
>>
>> This patch seryies provides following support
>> a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/
>>Encoding/Connector types which are common across SFP/SFP+ (SFF-8472)
>>and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files.
>> b) Support for diagnostics information for QSFP Plus/QSFP28 modules
>>based on SFF-8436/SFF-8636
>> c) Supporting 25G/50G/100G speeds in supported/advertising fields
>> d) Tested across various QSFP+/QSFP28 Copper/Optical modules
>>
>> Standards for QSFP+/QSFP28
>> a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016
>> b) SFF-8024 Rev 4.0 dated May 31, 2016
>>
>> v4:
>>   Sync ethtool-copy.h to kernel commit 
>> 89da45b8b5b2187734a11038b8593714f964ffd1
>>   which includes support for 50G base SR2
>>
>> v3:
>>  Review comments from Ben Hutchings:
>>Make sff diags structure common across sfpdiag.c and
>>qsfp.c and use common function to print common threshold
>>values.
>>  Review comments from Rami Rosen:
>>Cleanup description messages.
>>
>> v2:
>>   Included support for 25G/50G/100G speeds in supported/
>>   advertised speed modes
>>   Review comments from Ben Hutchings:
>> Split the sff-8024 reorganzing patch and QSFP+/QSFP28
>> patch
>> Fixed all checkpatch warnings (except couple of over 80 character)
>>
>> v1:
>>   Support for SFF-8636 Rev 2.7
>>   Review comments from Ben Hutchings:
>>Updating copyright holders information for QSFP
>>Reusing the common functions and macros across sfpid and qsfp
>>
>> Vidya Sagar Ravipati (4):
>>   ethtool-copy.h:sync with net
>>   ethtool:Reorganizing  SFF-8024 fields for SFP/QSFP
>>   ethtool:QSFP Plus/QSFP28 Diagnostics Information Support
>>   ethtool: Enhancing link mode bits to support 25G/50G/100G
>>
>>  Makefile.am|   2 +-
>>  ethtool-copy.h |  18 +-
>>  ethtool.c  |  35 +++
>>  internal.h |   3 +
>>  qsfp.c | 788 
>> +
>>  qsfp.h | 595 +++
>>  sff-common.c   | 304 ++
>>  sff-common.h   | 189 ++
>>  sfpdiag.c  | 105 +---
>>  sfpid.c| 103 +---
>>  10 files changed, 1945 insertions(+), 197 deletions(-)
>>  create mode 100644 qsfp.c
>>  create mode 100644 qsfp.h
>>  create mode 100644 sff-common.c
>>  create mode 100644 sff-common.h
>>
>> --
>> 2.1.4
>>
>>
>
> --
> John W. LinvilleSomeday the world will need a hero, and you
> linvi...@tuxdriver.com  might be all we have.  Be ready.


Re: CVE-2014-9900 fix is not upstream

2016-08-24 Thread Hannes Frederic Sowa
On 24.08.2016 16:03, Lennart Sorensen wrote:
> On Tue, Aug 23, 2016 at 10:25:45PM +0100, Al Viro wrote:
>> Sadly, sizeof is what we use when copying that sucker to userland.  So these
>> padding bits in the end would've leaked, true enough, and the case is 
>> somewhat
>> weaker.  And any normal architecture will have those, but then any such
>> architecture will have no more trouble zeroing a 32bit value than 16bit one.
> 
> Hmm, good point.  Too bad I don't see a compiler option of "zero all
> padding in structs".  Certainly generating the code should not really
> be that different.
> 
> I see someone did request it 2 years ago:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63479

I don't think this is sufficient. Basically if you write one field in a
struct after a memset again, the compiler is allowed by the standard to
write padding bytes again, causing them to be undefined.

If we want to go down this route, probably the only option is to add
__attribute__((pack)) those structs to just have no padding at all, thus
breaking uapi.

E.g. the x11 protocol implementation specifies padding bytes in their
binary representation of the wire protocol to limit the leaking:

https://cgit.freedesktop.org/xorg/proto/xproto/tree/Xproto.h

... which would be another option.

Bye,
Hannes



[PATCH v2 1/6] bpf: add new prog type for cgroup socket filtering

2016-08-24 Thread Daniel Mack
For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack 
---
 include/uapi/linux/bpf.h | 7 +++
 kernel/bpf/verifier.c| 1 +
 net/core/filter.c| 6 ++
 3 files changed, 14 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e4c5a1b..1d5db42 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,13 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};
+
+enum bpf_attach_type {
+   BPF_ATTACH_TYPE_CGROUP_INET_INGRESS,
+   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
 };
 
 #define BPF_PSEUDO_MAP_FD  1
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abb61f3..12ca880 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1805,6 +1805,7 @@ static bool may_access_skb(enum bpf_prog_type type)
case BPF_PROG_TYPE_SOCKET_FILTER:
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_CGROUP_SOCKET_FILTER:
return true;
default:
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index a83766b..bc04e5c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_filter_type __read_mostly = {
+   .ops= &sk_filter_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(&sk_filter_type);
bpf_register_prog_type(&sched_cls_type);
bpf_register_prog_type(&sched_act_type);
bpf_register_prog_type(&xdp_type);
+   bpf_register_prog_type(&cg_sk_filter_type);
 
return 0;
 }
-- 
2.5.5



[PATCH v2 4/6] net: filter: run cgroup eBPF ingress programs

2016-08-24 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack 
---
 net/core/filter.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index bc04e5c..163f75b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,11 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = cgroup_bpf_run_filter(sk, skb,
+   BPF_ATTACH_TYPE_CGROUP_INET_INGRESS);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.5.5



[PATCH v2 5/6] net: core: run cgroup eBPF egress programs

2016-08-24 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack 
---
 net/core/dev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index a75df86..17484e6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
*accel_priv)
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
 
+   rc = cgroup_bpf_run_filter(skb->sk, skb,
+  BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
+   if (rc)
+   return rc;
+
/* Disable soft irqs for various locks below. Also
 * stops preemption for RCU.
 */
-- 
2.5.5



[PATCH v2 0/6] Add eBPF hooks for cgroups

2016-08-24 Thread Daniel Mack
This is v2 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


As always, feedback is much appreciated.

Thanks,
Daniel

Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: core: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/bpf-cgroup.h  |  70 ++
 include/linux/cgroup-defs.h |   4 +
 include/uapi/linux/bpf.h|  16 
 init/Kconfig|  12 +++
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 159 
 kernel/bpf/syscall.c|  79 
 kernel/bpf/verifier.c   |   1 +
 kernel/cgroup.c |  18 +
 net/core/dev.c  |   6 ++
 net/core/filter.c   |  11 +++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  23 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +
 15 files changed, 552 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5



[PATCH v2 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-08-24 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack 
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  23 +++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 175 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index db3cb06..5c752f5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -47,6 +48,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..95e196e 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,29 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   .attach_flags = 0,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   .attach_flags = 0,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..0a44c3d
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+ 

[PATCH v2 2/6] cgroup: add support for eBPF programs

2016-08-24 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack 
---
 include/linux/bpf-cgroup.h  |  70 +++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 159 
 kernel/cgroup.c |  18 +
 6 files changed, 264 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..d85d50f
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,70 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[__MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *prog_effective[__MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_free(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   if (cgroup_bpf_enabled)
+   return __cgroup_bpf_run_filter(sk, skb, type);
+
+   return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_free(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs */
+   struct cgroup_bpf bpf;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..5a89c83 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
  Say N if unsure.
 
+config CGROUP_BPF
+   bool "Support for eBPF programs attached to cgroups"
+   depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+   help
+ Allow attaching eBPF programs to a cgroup using the bpf(2)
+ syscall command BPF_PROG_ATTACH.
+
+ In which context these programs are accessed depends on the type
+ of attachment. For instance, programs that are attached using
+ BPF_ATTACH_TYPE_CGROUP_INET_INGRESS will be executed on the
+ ingress path of inet sockets.
+
 config CGROUP_DEBUG
bool "Example controller"
default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -5,3 +5

[PATCH v2 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-24 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack 
syscall
---
 include/uapi/linux/bpf.h |  9 ++
 kernel/bpf/syscall.c | 79 
 2 files changed, 88 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1d5db42..4cc2dcf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;/* BPF_ATTACH_TYPE_* */
+   __u64   attach_flags;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..208cba2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,75 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   /* Flags are unused for now */
+   if (attr->attach_flags != 0)
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
+   case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
+   struct cgroup *cgrp;
+
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+
+   break;
+   }
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   switch (attr->attach_type) {
+   case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
+   case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
+   struct cgroup *cgrp;
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+
+   break;
+   }
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +957,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get(&attr);
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach(&attr);
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach(&attr);
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.5.5



Re: [PATCH net-next V2 2/4] net/dst: Utility functions to build dst_metadata without supplying an skb

2016-08-24 Thread Shmulik Ladkani
Hi,

On Wed, 24 Aug 2016 15:27:08 +0300 Amir Vadai  wrote:
> Extract _ip_tun_rx_dst() and _ipv6_tun_rx_dst() out of ip_tun_rx_dst()
> and ipv6_tun_rx_dst(), to be used without supplying an skb.

Additional thing.

In subsequent patches the newly introduced '_ip_tun_rx_dst' and
'_ipv6_tun_rx_dst' are used in a non "rx" context (e.g. for constructing
a IP_TUNNEL_INFO_TX in act_tunnel_key), so the names are misleading.

Consider renaming.


Re: [PATCH net-next V2 2/4] net/dst: Utility functions to build dst_metadata without supplying an skb

2016-08-24 Thread Shmulik Ladkani
Hi,

On Wed, 24 Aug 2016 15:27:08 +0300 Amir Vadai  wrote:
> +static inline struct metadata_dst *
> +_ipv6_tun_rx_dst(struct in6_addr saddr, struct in6_addr daddr,
> +  __u8 tos, __u8 ttl, __be32 label,
> +  __be16 flags, __be64 tunnel_id, int md_size)
> +{

Prefer 'const struct in6_addr *saddr' parameter (daddr too).

This is aligned with almost all functions having an 'in6_addr' as a
parameter, to prevent the costy argument copy.


Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 11:04 -0700, Rick Jones wrote:
> On 08/24/2016 10:23 AM, Eric Dumazet wrote:
> > From: Eric Dumazet 
> >
> > per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;
> 
> Is it possible it is non-trivially slower on other architectures?

No, in the worst case, compiler would emit the same code.




Re: [PATCH] net: macb: Increase DMA buffer size

2016-08-24 Thread Nicolas Ferre
Le 24/08/2016 à 20:25, Xander Huff a écrit :
> From: Nathan Sullivan 
> 
> In recent testing with the RT patchset, we have seen cases where the
> transmit ring can fill even with up to 200 txbds in the ring.  Increase
> the size of the DMA rings to avoid overruns.
> 
> Signed-off-by: Nathan Sullivan 
> Acked-by: Ben Shelton 
> Acked-by: Jaeden Amero 
> Natinst-ReviewBoard-ID: 83662
> ---
>  drivers/net/ethernet/cadence/macb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.c 
> b/drivers/net/ethernet/cadence/macb.c
> index 3256839..86a8e20 100644
> --- a/drivers/net/ethernet/cadence/macb.c
> +++ b/drivers/net/ethernet/cadence/macb.c
> @@ -35,12 +35,12 @@
>  
>  #include "macb.h"
>  
> -#define MACB_RX_BUFFER_SIZE  128
> +#define MACB_RX_BUFFER_SIZE  1536

This change seems not covered by the commit message. Can you  please
separate the changes in 2 patches or elaborate a bit more the reason for
this RX buffer size change.

Bye,

>  #define RX_BUFFER_MULTIPLE   64  /* bytes */
>  #define RX_RING_SIZE 512 /* must be power of 2 */
>  #define RX_RING_BYTES(sizeof(struct macb_dma_desc) * 
> RX_RING_SIZE)
>  
> -#define TX_RING_SIZE 128 /* must be power of 2 */
> +#define TX_RING_SIZE 512 /* must be power of 2 */
>  #define TX_RING_BYTES(sizeof(struct macb_dma_desc) * 
> TX_RING_SIZE)
>  
>  /* level of occupied TX descriptors under which we wake up TX process */
> 


-- 
Nicolas Ferre


Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 11:00 -0700, John Fastabend wrote:

> Looks good to me. I guess we can also do the same for overlimit qstats.
> 
> Acked-by: John Fastabend 

Not sure about overlimit, although I could probably change these :

net/sched/act_bpf.c:85: 
qstats_drop_inc(this_cpu_ptr(prog->common.cpu_qstats));
net/sched/act_gact.c:145:   
qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));




Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 10:50 -0700, John Fastabend wrote:
> On 16-08-24 10:26 AM, Eric Dumazet wrote:
> > On Wed, 2016-08-24 at 10:13 -0700, John Fastabend wrote:
> > 
> >>>
> >>
> >> I could fully allocate it in qdisc_alloc() but we don't know if the
> >> qdisc needs per cpu data structures until after the init call
> > 
> > Should not we have a flag to advertise the need of per spu stats on
> > qdisc ?
> > 
> > This is not clear why ->init() can know this, and not its caller.
> > 
> 
> sure we could be a static_flags field in the ops structure. What do
> you think about doing that?

This is what I was suggesting yes.

> 
> We would still need some flags to be set at init though like the bypass
> bit it looks like some qdiscs set that based on user input.
> 




Re: [PATCH net-next V2 4/4] net/sched: Introduce act_tunnel_key

2016-08-24 Thread Shmulik Ladkani
Hi,

On Wed, 24 Aug 2016 15:27:10 +0300 Amir Vadai  wrote:
> +config NET_ACT_TUNNEL_KEY
> +tristate "IP tunnel metadata manipulation"
> +depends on NET_CLS_ACT
> +---help---
> +   Say Y here to set/release ip tunnel metadata.
> +
> +   If unsure, say N.
> +
> +   To compile this code as a module, choose M here: the
> +   module will be called act_tunnel.

actually looks like it's called "act_tunnel_key" ;)

> +static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a,
> +   struct tcf_result *res)
> +{
> + struct tcf_tunnel_key *t = to_tunnel_key(a);
> + int action;
> +
> + spin_lock(&t->tcf_lock);
> + tcf_lastuse_update(&t->tcf_tm);
> + bstats_update(&t->tcf_bstats, skb);
> + action = t->tcf_action;
> +
> + switch (t->tcft_action) {
> + case TCA_TUNNEL_KEY_ACT_RELEASE:
> + skb_dst_set_noref(skb, NULL);
> + break;
> + case TCA_TUNNEL_KEY_ACT_SET:
> + skb_dst_set_noref(skb, &t->tcft_enc_metadata->dst);
> +
> + break;

nit: empty line unneeded here.

> +static int tunnel_key_init(struct net *net, struct nlattr *nla,
> +struct nlattr *est, struct tc_action **a,
> +int ovr, int bind)
> +{
> + struct tc_action_net *tn = net_generic(net, tunnel_key_net_id);
> + struct nlattr *tb[TCA_TUNNEL_KEY_MAX + 1];
> + struct metadata_dst *metadata = NULL;
> + struct tc_tunnel_key *parm;
> + struct tcf_tunnel_key *t;
> + __be64 key_id;
> + int encapdecap;
> + bool exists = false;
> + int ret = 0;
> + int err;
> +
> + if (!nla)
> + return -EINVAL;
> +
> + err = nla_parse_nested(tb, TCA_TUNNEL_KEY_MAX, nla, tunnel_key_policy);
> + if (err < 0)
> + return err;
> +
> + if (!tb[TCA_TUNNEL_KEY_PARMS])
> + return -EINVAL;
> +
> + parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]);
> + exists = tcf_hash_check(tn, parm->index, a, bind);
> + if (exists && bind)
> + return 0;
> +
> + encapdecap = parm->t_action;
> +
> + switch (encapdecap) {

As we no longer have "encapdecap" actions, either rename or just use
parm->t_action explicitly (only needed twice).

> +static int tunnel_key_dump_addresses(struct sk_buff *skb,
> +  const struct ip_tunnel_info *info)
> +{
> + unsigned short family = ip_tunnel_info_af(info);
> +
> + if (family == AF_INET) {
> + __be32 saddr = info->key.u.ipv4.src;
> + __be32 daddr = info->key.u.ipv4.dst;
> +
> + if (!nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_IPV4_SRC, saddr) &&
> + !nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_IPV4_DST, daddr))
> + return 0;
> + }
> +
> + if (family == AF_INET6) {
> + struct in6_addr saddr6 = info->key.u.ipv6.src;
> + struct in6_addr daddr6 = info->key.u.ipv6.dst;

Why the in6_addr copy? Point to the things, then pass the pointers to
nla_put_in6_addr().

Also, there are few lines too long.

Regards,
Shmulik


Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread David Ahern
On 8/24/16 11:41 AM, pravin shelar wrote:
> You also need to change pop_mpls().

What change is needed in pop_mpls? It already resets the mac_header and if MPLS 
labels are removed there is no need to set network_header. I take it you mean 
if the protocol is still MPLS and there are still labels then the network 
header needs to be set and that means finding the bottom label. Does OVS set 
the bottom of stack bit? From what I can tell OVS is not parsing the MPLS label 
so no requirement that BOS is set. Without that there is no way to tell when 
the labels are done short of guessing.

> 
> Anyways I was thinking about the neigh output functions skb pull
> issue, where it is using network-header offset. Can we use mac_len?
> this way we would not use any inner offsets for MPLS skb and current
> scheme used by OVS datapath works.

neigh_resolve_output and neigh_connected_output both do an __skb_pull to the 
network offset. When these functions are called there may or may not be a mac 
header set in the skb making the mac_header unreliable for how you want to use 
it. e.g. I tried this:

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2ae929f9bd06..9f20a0b8e6be 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1292,12 +1292,16 @@ int neigh_resolve_output(struct neighbour *neigh, 
struct sk_buff *skb)
int err;
struct net_device *dev = neigh->dev;
unsigned int seq;
+   unsigned int offset = skb_network_offset(skb);
+
+   if (unlikely(skb_mac_header_was_set(skb)))
+   offset = skb_mac_header(skb) - skb->data;

if (dev->header_ops->cache && !neigh->hh.hh_len)
neigh_hh_init(neigh);

do {
-   __skb_pull(skb, skb_network_offset(skb));
+   __skb_pull(skb, offset);
seq = read_seqbegin(&neigh->ha_lock);
err = dev_hard_header(skb, dev, ntohs(skb->protocol),
  neigh->ha, NULL, skb->len);


It does not work. The MPLS packet goes down the stack fine, but when the packet 
is forwarded from one namespace to another you can get a panic since it hits 
neigh_resolve_output with a mac header and the pull above will do the wrong 
thing.

[   18.254133] BUG: unable to handle kernel paging request at 88023860404a
[   18.255566] IP: [] eth_header+0x40/0xaf
[   18.256649] PGD 1c40067 PUD 0
[   18.257277] Oops: 0002 [#1] SMP
[   18.257872] Modules linked in: veth 8021q garp mrp stp llc vrf
[   18.259168] CPU: 2 PID: 868 Comm: ping Not tainted 4.8.0-rc2+ #81
[   18.260308] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[   18.262184] task: 88013ab61040 task.stack: 88013509
[   18.263285] RIP: 0010:[]  [] 
eth_header+0x40/0xaf
[   18.264762] RSP: 0018:88013fd03c80  EFLAGS: 00010216
[   18.265791] RAX: 88023860403e RBX: 0008 RCX: 88013a5c18a0
[   18.267040] RDX: 88023860403e RSI: 000e RDI: 88013ab0a200
[   18.268307] RBP: 88013fd03ca8 R08:  R09: 0058
[   18.269556] R10: 88023860403e R11:  R12: 88013a5c18a0
[   18.270807] R13: 880135b0b000 R14: 880135b0b000 R15: 88013a5c1828
[   18.272064] FS:  7fbc44b66700() GS:88013fd0() 
knlGS:
[   18.273477] CS:  0010 DS:  ES:  CR0: 80050033
[   18.274492] CR2: 88023860404a CR3: 0001350c8000 CR4: 000406e0
[   18.275746] Stack:
[   18.276125]   00580246 88013ab0a200 
0002
[   18.277519]  88013a5c1800 88013fd03cb8 813d5912 
88013fd03d00
[   18.278904]  813d73ea 88013a5c18a0 fffc01000246 
88013a5c1838
[   18.280295] Call Trace:
[   18.280712]  
[   18.281049]  [] dev_hard_header.constprop.42+0x26/0x28
[   18.282204]  [] neigh_resolve_output+0x1b9/0x270
[   18.283228]  [] neigh_update+0x372/0x497
[   18.284160]  [] arp_process+0x520/0x572
[   18.285061]  [] arp_rcv+0x10e/0x17d
[   18.285909]  [] __netif_receive_skb_core+0x3ea/0x4b8
[   18.286995]  [] __netif_receive_skb+0x16/0x66
[   18.287993]  [] process_backlog+0xa4/0x132
[   18.288935]  [] net_rx_action+0xd1/0x242
[   18.289854]  [] __do_softirq+0x100/0x26d
[   18.290764]  [] do_softirq_own_stack+0x1c/0x30
[   18.291775]  
[   18.292100]  [] do_softirq+0x30/0x3b
[   18.292968]  [] __local_bh_enable_ip+0x69/0x73
[   18.293919]  [] local_bh_enable+0x15/0x17
[   18.294798]  [] neigh_xmit+0x93/0xe3
[   18.295626]  [] mpls_xmit+0x379/0x3c0
[   18.296464]  [] lwtunnel_xmit+0x48/0x63



Generically though this approach just feels wrong. You want to lump the MPLS 
labels with the ethernet header but not formally, just by playing games with 
skb markers. The core networking stack is resisting this approach.



 


[PATCH] net: macb: Increase DMA buffer size

2016-08-24 Thread Xander Huff
From: Nathan Sullivan 

In recent testing with the RT patchset, we have seen cases where the
transmit ring can fill even with up to 200 txbds in the ring.  Increase
the size of the DMA rings to avoid overruns.

Signed-off-by: Nathan Sullivan 
Acked-by: Ben Shelton 
Acked-by: Jaeden Amero 
Natinst-ReviewBoard-ID: 83662
---
 drivers/net/ethernet/cadence/macb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index 3256839..86a8e20 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -35,12 +35,12 @@
 
 #include "macb.h"
 
-#define MACB_RX_BUFFER_SIZE128
+#define MACB_RX_BUFFER_SIZE1536
 #define RX_BUFFER_MULTIPLE 64  /* bytes */
 #define RX_RING_SIZE   512 /* must be power of 2 */
 #define RX_RING_BYTES  (sizeof(struct macb_dma_desc) * RX_RING_SIZE)
 
-#define TX_RING_SIZE   128 /* must be power of 2 */
+#define TX_RING_SIZE   512 /* must be power of 2 */
 #define TX_RING_BYTES  (sizeof(struct macb_dma_desc) * TX_RING_SIZE)
 
 /* level of occupied TX descriptors under which we wake up TX process */
-- 
1.9.1



Re: [PATCH] phy: request shared IRQ

2016-08-24 Thread Sergei Shtylyov

Hello.

On 08/24/2016 08:53 PM, Xander Huff wrote:


From: Nathan Sullivan 

On hardware with multiple PHY devices hooked up to the same IRQ line, allow
them to share it.


   Note that it had been allowed until my (erroneous?) commit 
33c133cc7598e60976a069344910d63e56cc4401 ("phy: IRQ cannot be shared"), so I'd 
like this commit just reverted instead...
   I'm not sure now what was the reason I concluded that the IRQ sharing was 
impossible... most probably I thought that the kernel IRQ handling code exited 
the loop over the IRQ actions once IRQ_HANDLED was returned -- which is 
obviously not so in reality...



Signed-off-by: Nathan Sullivan 
Signed-off-by: Xander Huff 
Acked-by: Ben Shelton 
Acked-by: Jaeden Amero 

[...]

MBR, Sergei



[PATCH iproute] iptuntap: show processes using tuntap interface

2016-08-24 Thread Hannes Frederic Sowa
Show which processes are using which tun/tap devices, e.g.:

$ ip -d tuntap
tun0: tun
Attached to processes: vpnc(9531)
vnet0: tap vnet_hdr
Attached to processes: qemu-system-x86(10442)
virbr0-nic: tap UNKNOWN_FLAGS:800
Attached to processes:

Signed-off-by: Hannes Frederic Sowa 
---
 ip/iptuntap.c | 109 ++
 1 file changed, 109 insertions(+)

diff --git a/ip/iptuntap.c b/ip/iptuntap.c
index 43774f96e335ef..b5aa0542c1f8f2 100644
--- a/ip/iptuntap.c
+++ b/ip/iptuntap.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rt_names.h"
 #include "utils.h"
@@ -273,6 +274,109 @@ static void print_flags(long flags)
printf(" UNKNOWN_FLAGS:%lx", flags);
 }
 
+static char *pid_name(pid_t pid)
+{
+   char *comm;
+   FILE *f;
+   int err;
+
+   err = asprintf(&comm, "/proc/%d/comm", pid);
+   if (err < 0)
+   return NULL;
+
+   f = fopen(comm, "r");
+   free(comm);
+   if (!f) {
+   perror("fopen");
+   return NULL;
+   }
+
+   if (fscanf(f, "%ms\n", &comm) != 1) {
+   perror("fscanf");
+   comm = NULL;
+   }
+
+
+   if (fclose(f))
+   perror("fclose");
+
+   return comm;
+}
+
+static void show_processes(const char *name)
+{
+   glob_t globbuf = { };
+   char **fd_path;
+   int err;
+
+   err = glob("/proc/[0-9]*/fd/[0-9]*", GLOB_NOSORT,
+  NULL, &globbuf);
+   if (err)
+   return;
+
+   fd_path = globbuf.gl_pathv;
+   while (*fd_path) {
+   const char *dev_net_tun = "/dev/net/tun";
+   const size_t linkbuf_len = strlen(dev_net_tun) + 2;
+   char linkbuf[linkbuf_len], *fdinfo;
+   int pid, fd;
+   FILE *f;
+
+   if (sscanf(*fd_path, "/proc/%d/fd/%d", &pid, &fd) != 2)
+   goto next;
+
+   if (pid == getpid())
+   goto next;
+
+   err = readlink(*fd_path, linkbuf, linkbuf_len - 1);
+   if (err < 0) {
+   perror("readlink");
+   goto next;
+   }
+   linkbuf[err] = '\0';
+   if (strcmp(dev_net_tun, linkbuf))
+   goto next;
+
+   if (asprintf(&fdinfo, "/proc/%d/fdinfo/%d", pid, fd) < 0)
+   goto next;
+
+   f = fopen(fdinfo, "r");
+   free(fdinfo);
+   if (!f) {
+   perror("fopen");
+   goto next;
+   }
+
+   while (!feof(f)) {
+   char *key = NULL, *value = NULL;
+
+   err = fscanf(f, "%m[^:]: %ms\n", &key, &value);
+   if (err == EOF) {
+   if (ferror(f))
+   perror("fscanf");
+   break;
+   } else if (err == 2 &&
+  !strcmp("iff", key) && !strcmp(name, value)) 
{
+   char *pname = pid_name(pid);
+   printf(" %s(%d)", pname ? pname : "", 
pid);
+   free(pname);
+   }
+
+   free(key);
+   free(value);
+   }
+   if (fclose(f))
+   perror("fclose");
+
+next:
+   ++fd_path;
+   }
+
+   globfree(&globbuf);
+   return;
+}
+
+
 static int do_show(int argc, char **argv)
 {
DIR *dir;
@@ -302,6 +406,11 @@ static int do_show(int argc, char **argv)
if (group != -1)
printf(" group %ld", group);
printf("\n");
+   if (show_details) {
+   printf("\tAttached to processes:");
+   show_processes(d->d_name);
+   printf("\n");
+   }
}
closedir(dir);
return 0;
-- 
2.7.4



Re: [PATCH net-next V2 1/4] net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id()

2016-08-24 Thread Shmulik Ladkani
On Wed, 24 Aug 2016 15:27:07 +0300 Amir Vadai  wrote:
> Add utility functions to convert a 32 bits key into a 64 bits tunnel and
> vice versa.
> These functions will be used instead of cloning code in GRE and VXLAN,
> and in tc act_iptunnel which will be introduced in a following patch in
> this patchset.
> 
> Signed-off-by: Amir Vadai 

Reviewed-by: Shmulik Ladkani 


Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats

2016-08-24 Thread John Fastabend
On 16-08-24 09:41 AM, Eric Dumazet wrote:
> On Tue, 2016-08-23 at 13:24 -0700, John Fastabend wrote:
>> Enable dflt qdisc support for per cpu stats before this patch a
>> dflt qdisc was required to use the global statistics qstats and
>> bstats.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  net/sched/sch_generic.c |   24 
>>  1 file changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 80544c2..910b4d15 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -646,18 +646,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
>> *dev_queue,
>>  struct Qdisc *sch;
>>  
>>  if (!try_module_get(ops->owner))
>> -goto errout;
>> +return NULL;
>>  
>>  sch = qdisc_alloc(dev_queue, ops);
>>  if (IS_ERR(sch))
>> -goto errout;
>> +return NULL;
>>  sch->parent = parentid;
>>  
>> -if (!ops->init || ops->init(sch, NULL) == 0)
>> +if (!ops->init)
>>  return sch;
>>  
>> -qdisc_destroy(sch);
>> +if (ops->init(sch, NULL))
>> +goto errout;
>> +
>> +/* init() may have set percpu flags so init data structures */
>> +if (qdisc_is_percpu_stats(sch)) {
>> +sch->cpu_bstats =
>> +netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
>> +if (!sch->cpu_bstats)
>> +goto errout;
>> +
>> +sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
>> +if (!sch->cpu_qstats)
>> +goto errout;
>> +}
>> +
> 
> Why are you attempting these allocations here instead of qdisc_alloc()
> 
> This looks weird, I would expect base qdisc being fully allocated before
> ops->init() is attempted.
> 
> 
> 

I could fully allocate it in qdisc_alloc() but we don't know if the
qdisc needs per cpu data structures until after the init call. So it
would sit unused in those cases if done from qdisc_alloc(). It seems
best to me at least to just avoid the allocation in qdisc_alloc() and
do it after init like I did here.

Perhaps it would be nice to pull these into a function call
post_init_qdisc_alloc() that does all this allocation?

.John



Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread Rick Jones

On 08/24/2016 10:23 AM, Eric Dumazet wrote:

From: Eric Dumazet 

per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;


Is it possible it is non-trivially slower on other architectures?

rick jones



Signed-off-by: Eric Dumazet 
---
 include/net/sch_generic.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 
0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5
 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch)

 static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch)
 {
-   qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats));
+   this_cpu_inc(sch->cpu_qstats->drops);
 }

 static inline void qdisc_qstats_overlimit(struct Qdisc *sch)





[PATCH net] net: dsa: bcm_sf2: Fix race condition while unmasking interrupts

2016-08-24 Thread Florian Fainelli
We kept shadow copies of which interrupt sources we have enabled and
disabled, but due to an order bug in how intrl2_mask_clear was defined,
we could run into the following scenario:

CPU0CPU1
intrl2_1_mask_clear(..)
sets INTRL2_CPU_MASK_CLEAR
bcm_sf2_switch_1_isr
read INTRL2_CPU_STATUS and masks with 
stale
irq1_mask value
updates irq1_mask value

Which would make us loop again and again trying to process and interrupt
we are not clearing since our copy of whether it was enabled before
still indicates it was not. Fix this by updating the shadow copy first,
and then unasking at the HW level.

Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index 463bed8cbe4c..dd446e466699 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -205,8 +205,8 @@ static inline void name##_writeq(struct bcm_sf2_priv *priv, 
u64 val,\
 static inline void intrl2_##which##_mask_clear(struct bcm_sf2_priv *priv, \
u32 mask)   \
 {  \
-   intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \
priv->irq##which##_mask &= ~(mask); \
+   intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \
 }  \
 static inline void intrl2_##which##_mask_set(struct bcm_sf2_priv *priv, \
u32 mask)   \
-- 
2.7.4



Re: [PATCH net-next] tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter

2016-08-24 Thread Stephen Hemminger
On Wed, 24 Aug 2016 09:01:23 -0700
Eric Dumazet  wrote:

> From: Eric Dumazet 
> 
> Adds SNMP counter for drops caused by MD5 mismatches.
> 
> The current syslog might help, but a counter is more precise and helps
> monitoring.
> 
> Signed-off-by: Eric Dumazet 
> ---
>  include/uapi/linux/snmp.h |1 +
>  net/ipv4/proc.c   |1 +
>  net/ipv4/tcp_ipv4.c   |1 +
>  net/ipv6/tcp_ipv6.c   |1 +
>  4 files changed, 4 insertions(+)
> 
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index 
> 25a9ad8bcef1240915f2553a8acade447186d869..e7a31f8306903f53bc5881ae4c271f85cad2e361
>  100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -235,6 +235,7 @@ enum
>   LINUX_MIB_TCPSPURIOUSRTOS,  /* TCPSpuriousRTOs */
>   LINUX_MIB_TCPMD5NOTFOUND,   /* TCPMD5NotFound */
>   LINUX_MIB_TCPMD5UNEXPECTED, /* TCPMD5Unexpected */
> + LINUX_MIB_TCPMD5FAILURE,/* TCPMD5Failure */
>   LINUX_MIB_SACKSHIFTED,
>   LINUX_MIB_SACKMERGED,
>   LINUX_MIB_SACKSHIFTFALLBACK,

You can't add value in middle of user API enum without breaking
binary compatibility.


Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread John Fastabend
On 16-08-24 10:23 AM, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;
> 
> Signed-off-by: Eric Dumazet 
> ---
>  include/net/sch_generic.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 
> 0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5
>  100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch)
>  
>  static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch)
>  {
> - qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats));
> + this_cpu_inc(sch->cpu_qstats->drops);
>  }
>  
>  static inline void qdisc_qstats_overlimit(struct Qdisc *sch)
> 
> 

Looks good to me. I guess we can also do the same for overlimit qstats.

Acked-by: John Fastabend 


[PATCH] phy: request shared IRQ

2016-08-24 Thread Xander Huff
From: Nathan Sullivan 

On hardware with multiple PHY devices hooked up to the same IRQ line, allow
them to share it.

Signed-off-by: Nathan Sullivan 
Signed-off-by: Xander Huff 
Acked-by: Ben Shelton 
Acked-by: Jaeden Amero 
---
 drivers/net/phy/phy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c5dc2c36..0050531 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -722,8 +722,8 @@ phy_err:
 int phy_start_interrupts(struct phy_device *phydev)
 {
atomic_set(&phydev->irq_disable, 0);
-   if (request_irq(phydev->irq, phy_interrupt, 0, "phy_interrupt",
-   phydev) < 0) {
+   if (request_irq(phydev->irq, phy_interrupt, IRQF_SHARED,
+   "phy_interrupt", phydev) < 0) {
pr_warn("%s: Can't get IRQ %d (PHY)\n",
phydev->mdio.bus->name, phydev->irq);
phydev->irq = PHY_POLL;
-- 
1.9.1



Re: [PATCH net-next] tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 10:35 -0700, Stephen Hemminger wrote:

> You can't add value in middle of user API enum without breaking
> binary compatibility.

There is no binary compatibility here.

/proc/net/netstat is a text file with a defined format.

First line contains the headers.

If 'binary compatibility 'was an issue, we would not have added anything
in this file.

Programs need to be able to properly parse these TcpExt: lines.
nstat is doing the right thing.

I could put LINUX_MIB_TCPMD5FAILURE at the end, but 'nstat' would have
these MD5 counters in different places.

So for the few people (ie not programs) looking at nstat, it seems
better to place this MIB at this point.





Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats

2016-08-24 Thread John Fastabend
On 16-08-24 10:26 AM, Eric Dumazet wrote:
> On Wed, 2016-08-24 at 10:13 -0700, John Fastabend wrote:
> 
>>>
>>
>> I could fully allocate it in qdisc_alloc() but we don't know if the
>> qdisc needs per cpu data structures until after the init call
> 
> Should not we have a flag to advertise the need of per spu stats on
> qdisc ?
> 
> This is not clear why ->init() can know this, and not its caller.
> 

sure we could be a static_flags field in the ops structure. What do
you think about doing that?

We would still need some flags to be set at init though like the bypass
bit it looks like some qdiscs set that based on user input.


>> . So it
>> would sit unused in those cases if done from qdisc_alloc(). It seems
>> best to me at least to just avoid the allocation in qdisc_alloc() and
>> do it after init like I did here.
>>
>> Perhaps it would be nice to pull these into a function call
>> post_init_qdisc_alloc() that does all this allocation?
>>
>> .John
>>
> 
> 



Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread pravin shelar
On Wed, Aug 24, 2016 at 9:37 AM, David Ahern  wrote:
> On 8/24/16 10:28 AM, pravin shelar wrote:
>>> How do you feel about implementing the do_output() idea I suggested above?
>>> I'm happy to provide testing and review.
>>
>> I am not sure about changing do_output(). why not just use same scheme
>> to track mpls header in OVS datapath as done in mpls device?
>>
>
> was just replying with the same.
>
> Something like this should be able to handle multiple labels. The inner 
> network header is set once and the outer one pointing to MPLS is adjusted 
> each time a label is pushed:
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 1ecbd7715f6d..0f37b17e3a73 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -162,10 +162,16 @@ static int push_mpls(struct sk_buff *skb, struct 
> sw_flow_key *key,
> if (skb_cow_head(skb, MPLS_HLEN) < 0)
> return -ENOMEM;
>
> +   if (!skb->inner_protocol) {
> +   skb_set_inner_network_header(skb, skb->mac_len);
> +   skb_set_inner_protocol(skb, skb->protocol);
> +   }
> +
> skb_push(skb, MPLS_HLEN);
> memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb),
> skb->mac_len);
> skb_reset_mac_header(skb);
> +   skb_set_network_header(skb, skb->mac_len);
>
> new_mpls_lse = (__be32 *)skb_mpls_header(skb);
> *new_mpls_lse = mpls->mpls_lse;
> @@ -173,8 +179,7 @@ static int push_mpls(struct sk_buff *skb, struct 
> sw_flow_key *key,
> skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);
>
> update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype);
> -   if (!skb->inner_protocol)
> -   skb_set_inner_protocol(skb, skb->protocol);
> +
> skb->protocol = mpls->mpls_ethertype;
>
> invalidate_flow_key(key);
>
>
>
>
> If it does, what else needs to be changed in OVS to handle the network layer 
> now pointing to the MPLS labels?
>
You also need to change pop_mpls().

Anyways I was thinking about the neigh output functions skb pull
issue, where it is using network-header offset. Can we use mac_len?
this way we would not use any inner offsets for MPLS skb and current
scheme used by OVS datapath works.


Re: [PATCH net-next v1] gso: Support partial splitting at the frag_list pointer

2016-08-24 Thread Marcelo Ricardo Leitner

Em 24-08-2016 13:27, Alexander Duyck escreveu:

On Wed, Aug 24, 2016 at 2:32 AM, Steffen Klassert
 wrote:

On Tue, Aug 23, 2016 at 07:47:32AM -0700, Alexander Duyck wrote:

On Mon, Aug 22, 2016 at 10:20 PM, Steffen Klassert
 wrote:

Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
gro may build buffers with a frag_list. This can hurt forwarding
because most NICs can't offload such packets, they need to be
segmented in software. This patch splits buffers with a frag_list
at the frag_list pointer into buffers that can be TSO offloaded.

Signed-off-by: Steffen Klassert 
---
 net/core/skbuff.c  | 89 +-
 net/ipv4/af_inet.c |  7 ++--
 net/ipv4/gre_offload.c |  7 +++-
 net/ipv4/tcp_offload.c |  3 ++
 net/ipv4/udp_offload.c |  9 +++--
 net/ipv6/ip6_offload.c |  6 +++-
 6 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b6..a614e9d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3078,6 +3078,92 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
sg = !!(features & NETIF_F_SG);
csum = !!can_checksum_protocol(features, proto);

+   headroom = skb_headroom(head_skb);
+
+   if (list_skb && net_gso_ok(features, skb_shinfo(head_skb)->gso_type) &&
+   csum && sg && (mss != GSO_BY_FRAGS) &&
+   !(features & NETIF_F_GSO_PARTIAL)) {


Does this really need to be mutually exclusive with
NETIF_F_GSO_PARTIAL and GSO_BY_FRAGS?


It should be possible to extend this to NETIF_F_GSO_PARTIAL but
I have no test for this. Regarding GSO_BY_FRAGS, this is rather
new and just used for sctp. I don't know what sctp does with
GSO_BY_FRAGS.


I'm adding Marcelo as he could probably explain the GSO_BY_FRAGS
functionality better than I could since he is the original author.

If I recall GSO_BY_FRAGS does something similar to what you are doing,
although I believe it doesn't carry any data in the first buffer other
than just a header.  I believe the idea behind GSO_BY_FRAGS was to
allow for segmenting a frame at the frag_list level instead of having
it done just based on MSS.  That was the only reason why I brought it
up.



That's exactly it.

On this no data in the first buffer limitation, we probably can allow it 
have some data in there. It was done this way just because sctp is using 
skb_gro_receive() to build such skb and this was the way I found to get 
such frag_list skb generated by it, thus preserving frame boundaries.


For using GSO_BY_FRAGS in gso_size, it's how skb_is_gso() returns true, 
but it's similar to the SKB_GSO_PARTIAL rationale in here. We can make 
sctp also flag it as SKB_GSO_PARTIAL if needed I guess, in case you need 
to maintain gso_size value.


  Marcelo


In you case though we maybe be able to make this easier.  If I am not
mistaken I believe we should have the main skb, and any in the chain
excluding the last containing the same amount of data.  That being the
case we should be able to determine the size that you would need to
segment at by taking skb->len, and removing the length of all the
skbuffs hanging off of frag_list.  At that point you just use that as
your MSS for segmentation and it should break things up so that you
have a series of equal sized segments split as the frag_list buffer
boundaries.

After that all that is left is to update the gso info for the buffers.
For GSO_PARTIAL I was handling that on the first segment only.  For
this change you would need to update that code to address the fact
that you would have to determine the number of segments on the first
frame and the last since the last could be less than the first, but
all of the others in-between should have the same number of segments.

- Alex



[PATCH -next] ibmvnic: convert to use simple_open()

2016-08-24 Thread Wei Yongjun
From: Wei Yongjun 

Remove an open coded simple_open() function and replace file
operations references to the function with simple_open()
instead.

Generated by: scripts/coccinelle/api/simple_open.cocci

Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index b942108..e862530 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2779,12 +2779,6 @@ static void handle_control_ras_rsp(union ibmvnic_crq 
*crq,
}
 }
 
-static int ibmvnic_fw_comp_open(struct inode *inode, struct file *file)
-{
-   file->private_data = inode->i_private;
-   return 0;
-}
-
 static ssize_t trace_read(struct file *file, char __user *user_buf, size_t len,
  loff_t *ppos)
 {
@@ -2836,7 +2830,7 @@ static ssize_t trace_read(struct file *file, char __user 
*user_buf, size_t len,
 
 static const struct file_operations trace_ops = {
.owner  = THIS_MODULE,
-   .open   = ibmvnic_fw_comp_open,
+   .open   = simple_open,
.read   = trace_read,
 };
 
@@ -2886,7 +2880,7 @@ static ssize_t paused_write(struct file *file, const char 
__user *user_buf,
 
 static const struct file_operations paused_ops = {
.owner  = THIS_MODULE,
-   .open   = ibmvnic_fw_comp_open,
+   .open   = simple_open,
.read   = paused_read,
.write  = paused_write,
 };
@@ -2934,7 +2928,7 @@ static ssize_t tracing_write(struct file *file, const 
char __user *user_buf,
 
 static const struct file_operations tracing_ops = {
.owner  = THIS_MODULE,
-   .open   = ibmvnic_fw_comp_open,
+   .open   = simple_open,
.read   = tracing_read,
.write  = tracing_write,
 };
@@ -2987,7 +2981,7 @@ static ssize_t error_level_write(struct file *file, const 
char __user *user_buf,
 
 static const struct file_operations error_level_ops = {
.owner  = THIS_MODULE,
-   .open   = ibmvnic_fw_comp_open,
+   .open   = simple_open,
.read   = error_level_read,
.write  = error_level_write,
 };
@@ -3038,7 +3032,7 @@ static ssize_t trace_level_write(struct file *file, const 
char __user *user_buf,
 
 static const struct file_operations trace_level_ops = {
.owner  = THIS_MODULE,
-   .open   = ibmvnic_fw_comp_open,
+   .open   = simple_open,
.read   = trace_level_read,
.write  = trace_level_write,
 };
@@ -3091,7 +3085,7 @@ static ssize_t trace_buff_size_write(struct file *file,
 
 static const struct file_operations trace_size_ops = {
.owner  = THIS_MODULE,
-   .open   = ibmvnic_fw_comp_open,
+   .open   = simple_open,
.read   = trace_buff_size_read,
.write  = trace_buff_size_write,
 };





Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 10:13 -0700, John Fastabend wrote:

> > 
> 
> I could fully allocate it in qdisc_alloc() but we don't know if the
> qdisc needs per cpu data structures until after the init call

Should not we have a flag to advertise the need of per spu stats on
qdisc ?

This is not clear why ->init() can know this, and not its caller.

> . So it
> would sit unused in those cases if done from qdisc_alloc(). It seems
> best to me at least to just avoid the allocation in qdisc_alloc() and
> do it after init like I did here.
> 
> Perhaps it would be nice to pull these into a function call
> post_init_qdisc_alloc() that does all this allocation?
> 
> .John
> 




Re: [PATCH net] qdisc: fix a module refcount leak in qdisc_create_dflt()

2016-08-24 Thread John Fastabend
On 16-08-24 09:39 AM, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Should qdisc_alloc() fail, we must release the module refcount
> we got right before.
> 
> Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
> Signed-off-by: Eric Dumazet 
> ---
>  net/sched/sch_generic.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index e95b67cd5718..657c13362b19 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -643,18 +643,19 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
> *dev_queue,
>   struct Qdisc *sch;
>  
>   if (!try_module_get(ops->owner))
> - goto errout;
> + return NULL;
>  
>   sch = qdisc_alloc(dev_queue, ops);
> - if (IS_ERR(sch))
> - goto errout;
> + if (IS_ERR(sch)) {
> + module_put(ops->owner);
> + return NULL;
> + }
>   sch->parent = parentid;
>  
>   if (!ops->init || ops->init(sch, NULL) == 0)
>   return sch;
>  
>   qdisc_destroy(sch);
> -errout:
>   return NULL;
>  }
>  EXPORT_SYMBOL(qdisc_create_dflt);
> 
> 

Thanks!

Acked-by: John Fastabend 


[PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread Eric Dumazet
From: Eric Dumazet 

per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;

Signed-off-by: Eric Dumazet 
---
 include/net/sch_generic.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 
0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5
 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch)
 
 static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch)
 {
-   qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats));
+   this_cpu_inc(sch->cpu_qstats->drops);
 }
 
 static inline void qdisc_qstats_overlimit(struct Qdisc *sch)




Re: [PATCH 0/4] net: phy: Register header file for Microsemi PHYs.

2016-08-24 Thread Florian Fainelli
On 08/24/2016 04:58 AM, Raju Lakkaraju wrote:
> From: Nagaraju Lakkaraju 
> 
> This is Microsemi's VSC 85xx PHY register definitions header file.

Please keep these register definitions local to the code using them
unless they are shared between multiple drivers.
-- 
Florian


Re: [PATCH 3/3] net: fs_enet: make rx_copybreak value configurable

2016-08-24 Thread Florian Fainelli
On 08/24/2016 03:36 AM, Christophe Leroy wrote:
> Measurement shows that on a MPC8xx running at 132MHz, the optimal
> limit is 112:
> * 114 bytes packets are processed in 147 TB ticks with higher copybreak
> * 114 bytes packets are processed in 148 TB ticks with lower copybreak
> * 128 bytes packets are processed in 154 TB ticks with higher copybreak
> * 128 bytes packets are processed in 148 TB ticks with lower copybreak
> * 238 bytes packets are processed in 172 TB ticks with higher copybreak
> * 238 bytes packets are processed in 148 TB ticks with lower copybreak
> 
> However it might be different on other processors
> and/or frequencies. So it is useful to make it configurable.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c | 8 +---
>  include/linux/fs_enet_pd.h| 1 -
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c 
> b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
> index addcae6..b59bbf8 100644
> --- a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
> +++ b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
> @@ -60,6 +60,10 @@ module_param(fs_enet_debug, int, 0);
>  MODULE_PARM_DESC(fs_enet_debug,
>"Freescale bitmapped debugging message enable value");
>  
> +static int rx_copybreak = 240;
> +module_param(rx_copybreak, int, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(rx_copybreak, "Receive copy threshold");

There is an ethtool tunable knob for copybreak now, which you should
prefer over a module parameter, see
drivers/net/ethernet/cisco/enic/enic_ethtool.c
-- 
Florian


Re: [PATCH net] qdisc: fix a module refcount leak in qdisc_create_dflt()

2016-08-24 Thread John Fastabend
On 16-08-24 09:39 AM, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Should qdisc_alloc() fail, we must release the module refcount
> we got right before.
> 
> Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
> Signed-off-by: Eric Dumazet 
> ---
>  net/sched/sch_generic.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index e95b67cd5718..657c13362b19 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -643,18 +643,19 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
> *dev_queue,
>   struct Qdisc *sch;
>  
>   if (!try_module_get(ops->owner))
> - goto errout;
> + return NULL;
>  
>   sch = qdisc_alloc(dev_queue, ops);
> - if (IS_ERR(sch))
> - goto errout;
> + if (IS_ERR(sch)) {
> + module_put(ops->owner);
> + return NULL;
> + }
>   sch->parent = parentid;
>  
>   if (!ops->init || ops->init(sch, NULL) == 0)
>   return sch;
>  
>   qdisc_destroy(sch);
> -errout:
>   return NULL;
>  }
>  EXPORT_SYMBOL(qdisc_create_dflt);
> 
> 

Thanks!

Acked-by: John Fastabend 


Re: [PATCH net-next 0/2] rxrpc: More fixes

2016-08-24 Thread David Miller
From: David Howells 
Date: Wed, 24 Aug 2016 15:59:46 +0100

> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-rewrite-20160824-1

Both -1 and -2 pulled, thanks David!


[PATCH net] qdisc: fix a module refcount leak in qdisc_create_dflt()

2016-08-24 Thread Eric Dumazet
From: Eric Dumazet 

Should qdisc_alloc() fail, we must release the module refcount
we got right before.

Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
Signed-off-by: Eric Dumazet 
---
 net/sched/sch_generic.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e95b67cd5718..657c13362b19 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -643,18 +643,19 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
*dev_queue,
struct Qdisc *sch;
 
if (!try_module_get(ops->owner))
-   goto errout;
+   return NULL;
 
sch = qdisc_alloc(dev_queue, ops);
-   if (IS_ERR(sch))
-   goto errout;
+   if (IS_ERR(sch)) {
+   module_put(ops->owner);
+   return NULL;
+   }
sch->parent = parentid;
 
if (!ops->init || ops->init(sch, NULL) == 0)
return sch;
 
qdisc_destroy(sch);
-errout:
return NULL;
 }
 EXPORT_SYMBOL(qdisc_create_dflt);




Re: [for-next 00/15][PULL request] Mellanox mlx5 core driver updates 2016-08-24

2016-08-24 Thread David Miller
From: Saeed Mahameed 
Date: Wed, 24 Aug 2016 13:38:59 +0300

> This series contains some low level and API updates for mlx5 core
> driver interface and mlx5_ifc.h, plus mlx5 LAG core driver support,
> to be shared as base code for net-next and rdma mlx5 4.9 submissions.

Pulled, thanks.


Re: [patch net 0/2] mlxsw: couple of fixes

2016-08-24 Thread David Miller
From: Jiri Pirko 
Date: Wed, 24 Aug 2016 11:18:50 +0200

> Ido Schimmel (1):
>   mlxsw: spectrum: Add missing flood to router port
> 
> Yotam Gigi (1):
>   mlxsw: router: Enable neighbors to be created on stacked devices

Both applied, thanks Jiri.


Re: [PATCH net-next] bnx2x: Don't flush multicast MACs

2016-08-24 Thread David Miller
From: Yuval Mintz 
Date: Wed, 24 Aug 2016 13:27:19 +0300

> When ndo_set_rx_mode() is called for bnx2x, as part of process of
> configuring the new MAC address filters [both unicast & multicast]
> driver begins by flushing the existing configuration and then iterating
> over the network device's list of addresses and configures those instead.
> 
> This has the side-effect of creating a short gap where traffic wouldn't
> be properly classified, as no filters are configured in HW.
> While for unicasts this is rather insignificant [as unicast MACs don't
> frequently change while interface is actually running],
> for multicast traffic it does pose an issue as there are multicast-based
> networks where new multicast groups would constantly be removed and
> added.
> 
> This patch tries to remedy this [at least for the newer adapters] -
> Instead of flushing & reconfiguring all existing multicast filters,
> the driver would instead create the approximate hash match that would
> result from the required filters. It would then compare it against the
> currently configured approximate hash match, and only add and remove the
> delta between those.
> 
> Signed-off-by: Yuval Mintz 

Applied.


Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread David Ahern
On 8/24/16 10:28 AM, pravin shelar wrote:
>> How do you feel about implementing the do_output() idea I suggested above?
>> I'm happy to provide testing and review.
> 
> I am not sure about changing do_output(). why not just use same scheme
> to track mpls header in OVS datapath as done in mpls device?
> 

was just replying with the same. 

Something like this should be able to handle multiple labels. The inner network 
header is set once and the outer one pointing to MPLS is adjusted each time a 
label is pushed:

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1ecbd7715f6d..0f37b17e3a73 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -162,10 +162,16 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
if (skb_cow_head(skb, MPLS_HLEN) < 0)
return -ENOMEM;

+   if (!skb->inner_protocol) {
+   skb_set_inner_network_header(skb, skb->mac_len);
+   skb_set_inner_protocol(skb, skb->protocol);
+   }
+
skb_push(skb, MPLS_HLEN);
memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb),
skb->mac_len);
skb_reset_mac_header(skb);
+   skb_set_network_header(skb, skb->mac_len);

new_mpls_lse = (__be32 *)skb_mpls_header(skb);
*new_mpls_lse = mpls->mpls_lse;
@@ -173,8 +179,7 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);

update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype);
-   if (!skb->inner_protocol)
-   skb_set_inner_protocol(skb, skb->protocol);
+
skb->protocol = mpls->mpls_ethertype;

invalidate_flow_key(key);




If it does, what else needs to be changed in OVS to handle the network layer 
now pointing to the MPLS labels?



Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats

2016-08-24 Thread Eric Dumazet
On Tue, 2016-08-23 at 13:24 -0700, John Fastabend wrote:
> Enable dflt qdisc support for per cpu stats before this patch a
> dflt qdisc was required to use the global statistics qstats and
> bstats.
> 
> Signed-off-by: John Fastabend 
> ---
>  net/sched/sch_generic.c |   24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 80544c2..910b4d15 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -646,18 +646,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
> *dev_queue,
>   struct Qdisc *sch;
>  
>   if (!try_module_get(ops->owner))
> - goto errout;
> + return NULL;
>  
>   sch = qdisc_alloc(dev_queue, ops);
>   if (IS_ERR(sch))
> - goto errout;
> + return NULL;
>   sch->parent = parentid;
>  
> - if (!ops->init || ops->init(sch, NULL) == 0)
> + if (!ops->init)
>   return sch;
>  
> - qdisc_destroy(sch);
> + if (ops->init(sch, NULL))
> + goto errout;
> +
> + /* init() may have set percpu flags so init data structures */
> + if (qdisc_is_percpu_stats(sch)) {
> + sch->cpu_bstats =
> + netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
> + if (!sch->cpu_bstats)
> + goto errout;
> +
> + sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
> + if (!sch->cpu_qstats)
> + goto errout;
> + }
> +

Why are you attempting these allocations here instead of qdisc_alloc()

This looks weird, I would expect base qdisc being fully allocated before
ops->init() is attempted.





Re: [patch net-next 0/7] mlxsw: Offload FDB learning configuration

2016-08-24 Thread David Miller
From: Jiri Pirko 
Date: Wed, 24 Aug 2016 12:00:22 +0200

> From: Jiri Pirko 
> 
> Ido says:
> This patchset addresses two long standing issues in the mlxsw driver
> concerning FDB learning.
> 
> Patch 1 limits the number of FDB records processed by the driver in a
> single session. This is useful in situations in which many new records
> need to be processed, thereby causing the RTNL mutex to be held for
> long periods of time.
> 
> Patches 2-6 offload the learning configuration (on / off) of bridge
> ports to the device instead of having the driver decide whether a
> record needs to be learned or not.
> 
> The last patch is fallout and removes configuration no longer necessary
> after the first patches are applied.

Looks good, series applied, thanks!


Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-24 Thread pravin shelar
On Wed, Aug 24, 2016 at 12:20 AM, Simon Horman
 wrote:
> Hi David,
>
> On Tue, Aug 23, 2016 at 01:24:51PM -0600, David Ahern wrote:
>> On 8/22/16 8:51 AM, Simon Horman wrote:
>> >
>> > The scheme that OvS uses so far is that mac_len denotes the number of bytes
>> > from the start of the MAC header until its end. In the absence of MPLS that
>> > will be the beginning of the network header. And in the presence of MPLS it
>> > will be the beginning of the MPLS label stack. The network header is... the
>> > network header. This allows the MAC header, MPLS label stack and network
>> > header to be tracked.
>>
>> The neigh output functions do '__skb_pull(skb, skb_network_offset(skb))' so 
>> if mpls_xmit does not reset the network header the labels get dropped. To me 
>> this says MPLS labels can not be lumped with the mac header which leaves the 
>> only option as the outer network header.
>>
>> >
>> > Pravin (CCed) may have different ideas but I wonder if the above scheme can
>> > be preserved while also meeting the needs of your new MPLS GSO scheme if
>> > you set skb_set_network_header() and skb_set_inner_network_header() in
>> > net/openvswitch/actions.c:do_output().
>> >
>> > It may also be possible to teach OvS to use skb_set_network_header to
>> > denote the beginning of the MPLS LSE and skb_set_inner_network_header to
>> > denote the network header in the presence of MPLS. Which is my current
>> > understanding of what you are trying to achieve. But I think its likely
>> > that I misunderstand things as it seems strange to me to pretend that an
>> > MPLS LSE is a network header and the outer most network header is an inner
>> > network header
>> >
>>
>> This is the only option I can see working, but open to patches showing an
>> alternative.
>
> On reflection I came to a similar conclusion.
>
>> I would like to get it resolved this week so I can move on to gso in the
>> mpls forward case.
>
> How do you feel about implementing the do_output() idea I suggested above?
> I'm happy to provide testing and review.

I am not sure about changing do_output(). why not just use same scheme
to track mpls header in OVS datapath as done in mpls device?


Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats

2016-08-24 Thread Eric Dumazet
On Tue, 2016-08-23 at 13:24 -0700, John Fastabend wrote:
> Enable dflt qdisc support for per cpu stats before this patch a
> dflt qdisc was required to use the global statistics qstats and
> bstats.
> 
> Signed-off-by: John Fastabend 
> ---
>  net/sched/sch_generic.c |   24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 80544c2..910b4d15 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -646,18 +646,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
> *dev_queue,
>   struct Qdisc *sch;
>  
>   if (!try_module_get(ops->owner))
> - goto errout;
> + return NULL;
>  
>   sch = qdisc_alloc(dev_queue, ops);
>   if (IS_ERR(sch))
> - goto errout;
> + return NULL;
>   sch->parent = parentid;
>  
> - if (!ops->init || ops->init(sch, NULL) == 0)
> + if (!ops->init)
>   return sch;
>  
> - qdisc_destroy(sch);
> + if (ops->init(sch, NULL))
> + goto errout;
> +
> + /* init() may have set percpu flags so init data structures */
> + if (qdisc_is_percpu_stats(sch)) {
> + sch->cpu_bstats =
> + netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
> + if (!sch->cpu_bstats)
> + goto errout;
> +
> + sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
> + if (!sch->cpu_qstats)
> + goto errout;
> + }
> +
> + return sch;
>  errout:
> + qdisc_destroy(sch);
>   return NULL;
>  }
>  EXPORT_SYMBOL(qdisc_create_dflt);
> 

Hmm... apparently we have bug here, added in 
6da7c8fcbcbdb50ec ("qdisc: allow setting default queuing discipline")

We do not undo the try_module_get() in case of an error.

I will send a fix.




Re: [PATCH net-next v1] gso: Support partial splitting at the frag_list pointer

2016-08-24 Thread Alexander Duyck
On Wed, Aug 24, 2016 at 2:32 AM, Steffen Klassert
 wrote:
> On Tue, Aug 23, 2016 at 07:47:32AM -0700, Alexander Duyck wrote:
>> On Mon, Aug 22, 2016 at 10:20 PM, Steffen Klassert
>>  wrote:
>> > Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
>> > gro may build buffers with a frag_list. This can hurt forwarding
>> > because most NICs can't offload such packets, they need to be
>> > segmented in software. This patch splits buffers with a frag_list
>> > at the frag_list pointer into buffers that can be TSO offloaded.
>> >
>> > Signed-off-by: Steffen Klassert 
>> > ---
>> >  net/core/skbuff.c  | 89 
>> > +-
>> >  net/ipv4/af_inet.c |  7 ++--
>> >  net/ipv4/gre_offload.c |  7 +++-
>> >  net/ipv4/tcp_offload.c |  3 ++
>> >  net/ipv4/udp_offload.c |  9 +++--
>> >  net/ipv6/ip6_offload.c |  6 +++-
>> >  6 files changed, 114 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index 3864b4b6..a614e9d 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -3078,6 +3078,92 @@ struct sk_buff *skb_segment(struct sk_buff 
>> > *head_skb,
>> > sg = !!(features & NETIF_F_SG);
>> > csum = !!can_checksum_protocol(features, proto);
>> >
>> > +   headroom = skb_headroom(head_skb);
>> > +
>> > +   if (list_skb && net_gso_ok(features, 
>> > skb_shinfo(head_skb)->gso_type) &&
>> > +   csum && sg && (mss != GSO_BY_FRAGS) &&
>> > +   !(features & NETIF_F_GSO_PARTIAL)) {
>>
>> Does this really need to be mutually exclusive with
>> NETIF_F_GSO_PARTIAL and GSO_BY_FRAGS?
>
> It should be possible to extend this to NETIF_F_GSO_PARTIAL but
> I have no test for this. Regarding GSO_BY_FRAGS, this is rather
> new and just used for sctp. I don't know what sctp does with
> GSO_BY_FRAGS.

I'm adding Marcelo as he could probably explain the GSO_BY_FRAGS
functionality better than I could since he is the original author.

If I recall GSO_BY_FRAGS does something similar to what you are doing,
although I believe it doesn't carry any data in the first buffer other
than just a header.  I believe the idea behind GSO_BY_FRAGS was to
allow for segmenting a frame at the frag_list level instead of having
it done just based on MSS.  That was the only reason why I brought it
up.

In you case though we maybe be able to make this easier.  If I am not
mistaken I believe we should have the main skb, and any in the chain
excluding the last containing the same amount of data.  That being the
case we should be able to determine the size that you would need to
segment at by taking skb->len, and removing the length of all the
skbuffs hanging off of frag_list.  At that point you just use that as
your MSS for segmentation and it should break things up so that you
have a series of equal sized segments split as the frag_list buffer
boundaries.

After that all that is left is to update the gso info for the buffers.
For GSO_PARTIAL I was handling that on the first segment only.  For
this change you would need to update that code to address the fact
that you would have to determine the number of segments on the first
frame and the last since the last could be less than the first, but
all of the others in-between should have the same number of segments.

- Alex


[PATCH net-next] tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter

2016-08-24 Thread Eric Dumazet
From: Eric Dumazet 

Adds SNMP counter for drops caused by MD5 mismatches.

The current syslog might help, but a counter is more precise and helps
monitoring.

Signed-off-by: Eric Dumazet 
---
 include/uapi/linux/snmp.h |1 +
 net/ipv4/proc.c   |1 +
 net/ipv4/tcp_ipv4.c   |1 +
 net/ipv6/tcp_ipv6.c   |1 +
 4 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 
25a9ad8bcef1240915f2553a8acade447186d869..e7a31f8306903f53bc5881ae4c271f85cad2e361
 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -235,6 +235,7 @@ enum
LINUX_MIB_TCPSPURIOUSRTOS,  /* TCPSpuriousRTOs */
LINUX_MIB_TCPMD5NOTFOUND,   /* TCPMD5NotFound */
LINUX_MIB_TCPMD5UNEXPECTED, /* TCPMD5Unexpected */
+   LINUX_MIB_TCPMD5FAILURE,/* TCPMD5Failure */
LINUX_MIB_SACKSHIFTED,
LINUX_MIB_SACKMERGED,
LINUX_MIB_SACKSHIFTFALLBACK,
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 
9f665b63a927202b9aaf2b6b3d42205058a2ae59..1ed015e4bc792acdd520a5df95ffac33ebefc4db
 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -257,6 +257,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPSpuriousRTOs", LINUX_MIB_TCPSPURIOUSRTOS),
SNMP_MIB_ITEM("TCPMD5NotFound", LINUX_MIB_TCPMD5NOTFOUND),
SNMP_MIB_ITEM("TCPMD5Unexpected", LINUX_MIB_TCPMD5UNEXPECTED),
+   SNMP_MIB_ITEM("TCPMD5Failure", LINUX_MIB_TCPMD5FAILURE),
SNMP_MIB_ITEM("TCPSackShifted", LINUX_MIB_SACKSHIFTED),
SNMP_MIB_ITEM("TCPSackMerged", LINUX_MIB_SACKMERGED),
SNMP_MIB_ITEM("TCPSackShiftFallback", LINUX_MIB_SACKSHIFTFALLBACK),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
32b048e524d6773538918eca175b3f422f9c2aa7..45aac7ada13592c6f1c9f28aea4426b40520e0c8
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1169,6 +1169,7 @@ static bool tcp_v4_inbound_md5_hash(const struct sock *sk,
  NULL, skb);
 
if (genhash || memcmp(hash_location, newhash, 16) != 0) {
+   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMD5FAILURE);
net_info_ratelimited("MD5 Hash failed for (%pI4, %d)->(%pI4, 
%d)%s\n",
 &iph->saddr, ntohs(th->source),
 &iph->daddr, ntohs(th->dest),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 
e0f46439e391f2a8b2fac2e13b6f61a11c082715..60a65d058349c93fb66275434f6fe162a621782e
 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -671,6 +671,7 @@ static bool tcp_v6_inbound_md5_hash(const struct sock *sk,
  NULL, skb);
 
if (genhash || memcmp(hash_location, newhash, 16) != 0) {
+   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMD5FAILURE);
net_info_ratelimited("MD5 Hash %s for [%pI6c]:%u->[%pI6c]:%u\n",
 genhash ? "failed" : "mismatch",
 &ip6h->saddr, ntohs(th->source),




Re: wan-cosa: Use memdup_user() rather than duplicating its implementation

2016-08-24 Thread SF Markus Elfring
>   What about the GFP_DMA attribute, which your patch deletes?
> The buffer in question has to be ISA DMA-able.

Thanks for your constructive feedback.

Would you be interested in using a variant of the function "memdup_…"
with which the corresponding memory allocation option can be preserved?

Regards,
Markus


Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds

2016-08-24 Thread John W. Linville
On Wed, Aug 24, 2016 at 04:29:22AM +, Yuval Mintz wrote:
> > This patch series provides following support
> > a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/
> >Encoding/Connector types which are common across SFP/SFP+ (SFF-8472)
> >and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files.
> > b) Support for diagnostics information for QSFP Plus/QSFP28 modules
> >based on SFF-8436/SFF-8636
> > c) Supporting 25G/50G/100G speeds in supported/advertising fields
> > d) Tested across various QSFP+/QSFP28 Copper/Optical modules
> > 
> > Standards for QSFP+/QSFP28
> > a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016
> > b) SFF-8024 Rev 4.0 dated May 31, 2016
> > 
> > v4:
> >   Sync ethtool-copy.h to kernel commit
> > 89da45b8b5b2187734a11038b8593714f964ffd1
> >   which includes support for 50G base SR2
> 
> What about the man-page?

I can just apply your man page patch on top.

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.


Re: [PATCH 1/3 v2] net: smsc911x: augment device tree bindings

2016-08-24 Thread Arnd Bergmann
On Wednesday, August 24, 2016 2:59:40 PM CEST Linus Walleij wrote:
> +- interrupts : Should contain the SMSC LAN
> +  interrupt line as cell 0, cell 1 is an OPTIONAL PME (power
> +  management event) interrupt that is able to wake up the host
> +  system with a 50ms pulse on network activity
> +  For generic bindings for interrupt controller parents, refer to
> +  interrupt-controller/interrupts.txt

I think you should (slightly) reword this to avoid using the
term "cell", which refers to a 32-bit word in the property,
not the interrupt specifier that is often made up of two or
three cells.

Maybe something like

- interrupts: one or two interrupt specifiers:
- The first interrupt is the SMSC LAN interrupt line.
- The second interrupt (if present) is the power management
  event ...

Arnd



  1   2   3   >