[PATCH RFC 1/2] cdc_ncm: add the currently processed NDP frame to global driver data

2015-06-02 Thread Enrico Mioso
This is useful to split up the cdc_ncm_ndp function later on.
The resulting code will be anyway stateful.

Signed-Off-By: Enrico Mioso mrkiko...@gmail.com
---
 include/linux/usb/cdc_ncm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/usb/cdc_ncm.h b/include/linux/usb/cdc_ncm.h
index 7c9b484..9172256 100644
--- a/include/linux/usb/cdc_ncm.h
+++ b/include/linux/usb/cdc_ncm.h
@@ -100,6 +100,7 @@ struct cdc_ncm_ctx {
struct sk_buff *tx_curr_skb;
struct sk_buff *tx_rem_skb;
__le32 tx_rem_sign;
+   struct usb_cdc_ncm_ndp16 *tx_curr_ndp16;
 
spinlock_t mtx;
atomic_t stop;
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/2] cdc_ncm refactoring

2015-06-02 Thread Enrico Mioso
I changed my mind, and decided to try in following this new way.
This series splits the cdc_ncm_ndp function in two parts:
- one that finds NDP blocks already present in the SKB being sent out
- one that pushes new ones, starting from where the _find function left.

After this splitting it seems more easy to modify the location where the NDP is 
disposed.
What do you think about this?

From now on, I need a little bit of help: I think we might work on the 
cdc_ncm_ndp16_push function, still I am open to any suggestion.

Let me know if you like this.
Enrico

Enrico Mioso (2):
  cdc_ncm: add the currently processed NDP frame to global driver data
  cdc_ncm: split the cdc_ncm_ndp funciton

 drivers/net/usb/cdc_ncm.c   | 30 +-
 include/linux/usb/cdc_ncm.h |  1 +
 2 files changed, 22 insertions(+), 9 deletions(-)

-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 5/5] rocker: remove support for legacy VLAN ndo ops

2015-06-02 Thread Scott Feldman
On Mon, Jun 1, 2015 at 10:24 PM, David Miller da...@davemloft.net wrote:
 From: Toshiaki Makita makita.toshi...@lab.ntt.co.jp
 Date: Tue, 02 Jun 2015 13:51:06 +0900

 On 2015/06/02 3:39, sfel...@gmail.com wrote:
 From: Scott Feldman sfel...@gmail.com

 Remove support for legacy ndo ops
 .ndo_vlan_rx_add_vid/.ndo_vlan_rx_kill_vid.  Rocker will use
 bridge_setlink/dellink exclusively for VLAN add/del operations.

 The legacy ops are needed if using 8021q driver module to setup VLANs on
 the port.  But an alternative exists in using bridge_setlink/delink to
 setup VLANs, which doesn't depend on 8021q module.  So rocker will switch
 to the newer setlink/dellink ops.  VLANs can added/delete from the port,
 regardless if port is bridged or not, using the bridge commands:

  bridge vlan [add|del] vid VID dev DEV self

 Hi Scott,

 This doesn't look transparent with bridge.

 Before this patch, I was able to add vid in the same way as software bridge:

   ip link set DEV master br0
   bridge vlan add vid VID dev DEV

 Now I need to add self, which is different from software bridge...

 I'm already not liking the looks of this

Actually, we're now consistent with bridge man page which says master
is the default.

Want we want, I believe, is to adjust what the man page says (and the
bridge vlan command itself), by making the default master and self.
The kernel and driver are fine, it's the default in the bridge command
that needs adjusting.  Once we do this, we'll be back to transparent
with software-only bridge.

How did we get here?   So the RTM_SETLINK for PF_BRIDGE calls
rtnl_bridge_setlink().  rtnl_bridge_setlink() calls ndo_bridge_setlink
for the master (the bridge side of the port) and self (the device side
of the port), depending on if MASTER and/or SELF flags are set.  Since
the default from the iproute2 bridge vlan cmd is to only set MASTER,
only the bridge's ndo_bridge_setlink is called.  But if you dig down
into the bridge's ndo_bridge_setlink, you'll see it will call into the
port driver's ndo_vlan_rx_add_vid() to add the vlan to the device side
of the port.  So we have a MASTER cmd that is doing some SELF work.
My guess this was done to avoid having to update all the NIC drivers
from  ndo_vlan_rx_add_vid to ndo_bridge_setlink.  When you remove
ndo_vlan_rx_add_vid() from the port driver, the cmd needs to target
MASTER and SELF for both sides of the port to be called.  But the
current cmd only sets MASTER.  This is why you (currently) need to add
SELF for cmd to target the device side of the port.

On top of all of this, you can use RTM_SETLINK for PF_BRIDGE on
non-bridged ports, in which case only SELF is used to program the VLAN
on the device, using the device's ndo_bridge_setlink.  This is the
confusing part where you can set VLANs on non-bridged ports using the
bridge cmd.

To summarize, pseudo code for rtnl_bridge_setlink() is:

rtnl_bridge_setlink()
if MASTER
call bridge's ndo_bridge_setlink()
if bridge port implements ndo_vlan_rx_add_vid()
call ndo_vlan_rx_add_vid() on port device to set vlan
if SELF
call port device's ndo_bridge_setlink()

If DEV is bridged, today we have:

bridge vlan add vid VID dev DEV  sets
MASTER (default)
bridge vlan add vid VID dev DEV master   sets MASTER
bridge vlan add vid VID dev DEV selfsets SELF
bridge vlan add vid VID dev DEV master self sets MASTER and SELF

if DEV is not bridged, today we have:

bridge vlan add vid VID dev DEV  //
fails  (no master device)
bridge vlan add vid VID dev DEV selfsets SELF

What I propose is we change the bridge vlan cmd for the DEV bridged
case as such:

bridge vlan add vid VID dev DEV  sets
MASTER and SELF (default)
bridge vlan add vid VID dev DEV master   sets MASTER
bridge vlan add vid VID dev DEV selfsets SELF
bridge vlan add vid VID dev DEV master self sets MASTER and SELF

For existing users of ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid,
nothing really changes.  If they also have an ndo_bridge_setlink,
it'll get called but they're not doing any vlan stuff there today
anyway, so it's ignored.

For rocker, we're switching to doing all vlan stuff in
ndo_bridge_setlink.  Switching to ndo_bridge_setlink for switchdev
gives us support for stacked drivers with the transaction model,
something we don't get with ndo_vlan_rx_add_vid.

If this makes sense, I'll post the follow up bridge vlan cmd change to
default to master and self.

-scott
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] mac80211: Switch to new AEAD interface

2015-06-02 Thread Jouni Malinen
On Mon, Jun 01, 2015 at 05:36:58PM +0200, Stephan Mueller wrote:
 Am Montag, 1. Juni 2015, 16:35:26 schrieb Johannes Berg:
 IOW, I think something like this would make sense:
 
 That looks definitely cleaner :-)

Indeed.. That AAD length-in-the-buffer design came from the over ten
year old code that was optimized to cover the CCM construction with the
same buffer and that was not cleaned up when this was converted to use
cryptoapi couple of years ago.

 Though, my main concern was just to ensure that the aad length value is not 
 zero.

It won't be in IEEE 802.11 use cases. The exact length depends on the
IEEE 802.11 frame type, but AAD is constructed in a way that it is
normally a bit over 20 octets while allowing CCM to fit the related
operations into two AES blocks.
 
-- 
Jouni MalinenPGP id EFC895FA
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] net: socket: Fix the wrong returns for recvmsg and sendmsg

2015-06-02 Thread Junling Zheng
On 2015/6/2 14:52, Willy Tarreau wrote:
 On Tue, Jun 02, 2015 at 02:43:54PM +0800, Junling Zheng wrote:
 On 2015/6/2 14:27, Greg KH wrote:
 On Mon, Jun 01, 2015 at 10:23:57PM -0700, David Miller wrote:
 From: Junling Zheng zhengjunl...@huawei.com
 Date: Tue, 2 Jun 2015 12:05:32 +0800

 So, the problem commit is 281c9c36 (net: compat: Update
 get_compat_msghdr() to match copy_msghdr_from_user() behaviour),
 which fixes db31c55a6fb2 and brings the get_compat_msghdr() in line
 with copy_msghdr_from_user().

 Upstream this got fixed by:

 08adb7dabd4874cc5666b4490653b26534702ce0

 So the part that makes us not unconditionally return -EFAULT needs
 to be backported, and that's probably equivalent to the patch
 your proposed which therefore should be applied.

 Ok, thanks, now applied.


 Maybe other stable version also needs this fix:)
 
 Yes, from what I'm seeing, at least 3.2 and 2.6.32 need it as well.
 

Yeah, all other stable versions *except 3.19 and 4.0* may need this fix:)

 Thanks,
 Willy
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops

2015-06-02 Thread Guenter Roeck

Vivien,

On 06/01/2015 06:27 PM, Vivien Didelot wrote:

This commit implements the port_vlan_add and port_vlan_del functions in
the dsa_switch_driver structure for Marvell 88E6xxx compatible switches.

This allows to access a switch VLAN Table Unit, and thus define VLANs
from standard userspace commands such as bridge vlan.

Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
---


[ ... ]


+
+int mv88e6xxx_port_vlan_add(struct dsa_switch *ds, int port, u16 vid,
+   u16 bridge_flags)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   struct mv88e6xxx_vtu_entry entry = { 0 };
+   int prev_vid = vid ? vid - 1 : 4095;
+   int i, ret;
+
+   /* Bringing an interface up adds it to the VLAN 0. Ignore this. */
+   if (!vid)
+   return 0;
+


Me puzzled ;-). I brought this and the fid question up before.
No idea if my e-mail got lost or what happened.

Can you explain why we don't need a configuration for vlan 0 ?


+   /* The DSA port-based VLAN setup reserves FID 0 to DSA_MAX_PORTS;
+* we will use the next FIDs for 802.1q;
+* thus, forbid the last DSA_MAX_PORTS VLANs.
+*/
+   if (vid  4095 - DSA_MAX_PORTS)
+   return -EINVAL;
+
+   mutex_lock(ps-smi_mutex);
+   ret = _mv88e6xxx_vtu_getnext(ds, prev_vid, entry);
+   if (ret  0)
+   goto unlock;
+
+   /* If the VLAN does not exist, re-initialize the entry for addition */
+   if (entry.vid != vid || !entry.valid) {
+   memset(entry, 0, sizeof(entry));
+   entry.valid = true;
+   entry.vid = vid;
+   entry.fid = DSA_MAX_PORTS + vid;


I brought this up before. No idea if my e-mail got lost or what happened.

We use a fid per port, and a fid per bridge group. With VLANs, this is 
completely
ignored, ahd there is only a single fid per vlan for the entire switch.

Either per-port fids are unnecessary as well, or something is wrong here,
or I am missing something. Can you explain why we only need a single fid
per vlan, even if we have multiple bridge groups and the same vlan is
configured in all of them ?

Thanks,
Guenter

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/9] net: dsa: add basic support for VLAN operations

2015-06-02 Thread Guenter Roeck

On 06/01/2015 06:27 PM, Vivien Didelot wrote:

This patch adds the glue between DSA and switchdev to add and delete
SWITCHDEV_OBJ_PORT_VLAN objects.

This will allow the DSA switch drivers implementing the port_vlan_add
and port_vlan_del functions to access the switch VLAN database through
userspace commands such as bridge vlan.

Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
---
  include/net/dsa.h |  7 +++
  net/dsa/slave.c   | 61 +--
  2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index fbca63b..726357b 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -302,6 +302,13 @@ struct dsa_switch_driver {
   const unsigned char *addr, u16 vid);
int (*fdb_getnext)(struct dsa_switch *ds, int port,
   unsigned char *addr, bool *is_static);
+
+   /*
+* VLAN support
+*/
+   int (*port_vlan_add)(struct dsa_switch *ds, int port, u16 vid,
+u16 bridge_flags);
+   int (*port_vlan_del)(struct dsa_switch *ds, int port, u16 vid);
  };

  void register_switch_driver(struct dsa_switch_driver *type);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index cbda00a..52ba5a1 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -363,6 +363,25 @@ static int dsa_slave_port_attr_set(struct net_device *dev,
return ret;
  }

+static int dsa_slave_port_vlans_add(struct net_device *dev,
+   struct switchdev_obj_vlan *vlan)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p-parent;
+   int vid, err = 0;
+
+   if (!ds-drv-port_vlan_add)
+   return -ENOTSUPP;
+
+   for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) {
+   err = ds-drv-port_vlan_add(ds, p-port, vid, vlan-flags);
+   if (err)
+   break;
+   }
+
+   return err;
+}
+
  static int dsa_slave_port_obj_add(struct net_device *dev,
  struct switchdev_obj *obj)
  {
@@ -378,6 +397,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
return 0;

switch (obj-id) {
+   case SWITCHDEV_OBJ_PORT_VLAN:
+   err = dsa_slave_port_vlans_add(dev, obj-u.vlan);
+   break;
default:
err = -ENOTSUPP;
break;
@@ -386,12 +408,34 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
return err;
  }

+static int dsa_slave_port_vlans_del(struct net_device *dev,
+   struct switchdev_obj_vlan *vlan)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p-parent;
+   int vid, err = 0;
+
+   if (!ds-drv-port_vlan_del)
+   return -ENOTSUPP;
+
+   for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) {
+   err = ds-drv-port_vlan_del(ds, p-port, vid);
+   if (err)
+   break;
+   }
+
+   return err;
+}
+
  static int dsa_slave_port_obj_del(struct net_device *dev,
  struct switchdev_obj *obj)
  {
int err;

switch (obj-id) {
+   case SWITCHDEV_OBJ_PORT_VLAN:
+   err = dsa_slave_port_vlans_del(dev, obj-u.vlan);
+   break;
default:
err = -EOPNOTSUPP;
break;
@@ -473,6 +517,15 @@ static netdev_tx_t dsa_slave_notag_xmit(struct sk_buff 
*skb,
return NETDEV_TX_OK;
  }

+static int dsa_slave_vlan_noop(struct net_device *dev, __be16 proto, u16 vid)
+{
+   /* NETIF_F_HW_VLAN_CTAG_FILTER requires ndo_vlan_rx_add_vid and
+* ndo_vlan_rx_kill_vid, otherwise the VLAN acceleration is considered
+* buggy (see net/core/dev.c).
+*/


As Scott mentioned, just don't set NETIF_F_HW_VLAN_CTAG_FILTER.

I don't entirely understand why we would not want to filter VLANs in the switch.
Can you explain ?

Thanks,
Guenter

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2 01/14] sfc: Add code to export port_num in netdev-dev_port

2015-06-02 Thread Shradha Shah


On 01/06/15 20:01, David Miller wrote:
 From: Shradha Shah ss...@solarflare.com
 Date: Mon, 1 Jun 2015 14:00:12 +0100
 
 In the case where we have multiple functions (PFs and VFs), this
 sysfs entry is useful to identify the physical port corresponding
 to the function we are interested in.

 Signed-off-by: Shradha Shah ss...@solarflare.com
 
 This is a low effort change.
 
 You retained all of the error handling changes that were only necessary when
 you added the new sysfs file, but are completely unnecessary if you're
 just reporting it via netdev-dev_port.

With the addition of the sysfs change in my previous version, the error handling
code required the addition of a fail4 tag to deal with the sysfs file on the
error path.

Without the sysfs file in my current version v2, there is no extra fail4 tag, I
have reverted back to using fail3.

The changes that are seen in the patch are stylistic changes following the rule
that every branch of an if statement should have parenthesis if one of the
branch uses parenthesis.

The previous version of the patch touched this bit of code so the style change
was relevant.

I think my mistake here is that I left it in v2 as a style change, but I can
assure you that I looked at the error path before submitting the patch and
also that this patch does not affect the error path.

Maybe I should have separated the style change to go as a different patch.

I will do so now and submit a v3.

Thanks.

 
 This is extremely disappointing, because you expect me to put a good effort
 into reviewing your changes yet you aren't putting that level of effort into
 the submission itself.
 

-- 
Many Thanks,
Regards,
Shradha Shah
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


linux-next: manual merge of the scsi tree with the net-next tree

2015-06-02 Thread Stephen Rothwell
Hi James,

Today's linux-next merge of the scsi tree got a conflict in
drivers/target/target_core_user.c between commit 5538d294dd66
(treewide: Add missing vmalloc.h inclusion) from the net-next tree
and commit 7ad09a15e76b (target: Minimize SCSI header #include
directives) from the scsi tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc drivers/target/target_core_user.c
index edc98250,21b438ec4700..
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@@ -19,13 -19,13 +19,14 @@@
  #include linux/spinlock.h
  #include linux/module.h
  #include linux/idr.h
+ #include linux/kernel.h
  #include linux/timer.h
  #include linux/parser.h
 +#include linux/vmalloc.h
- #include scsi/scsi.h
- #include scsi/scsi_host.h
  #include linux/uio_driver.h
  #include net/genetlink.h
+ #include scsi/scsi_common.h
+ #include scsi/scsi_proto.h
  #include target/target_core_base.h
  #include target/target_core_fabric.h
  #include target/target_core_backend.h


pgp5ivnV19BG0.pgp
Description: OpenPGP digital signature


Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops

2015-06-02 Thread Scott Feldman
On Mon, Jun 1, 2015 at 11:50 PM, Guenter Roeck li...@roeck-us.net wrote:

[cut]

 I brought this up before. No idea if my e-mail got lost or what happened.

 We use a fid per port, and a fid per bridge group. With VLANs, this is
 completely
 ignored, ahd there is only a single fid per vlan for the entire switch.

 Either per-port fids are unnecessary as well, or something is wrong here,
 or I am missing something. Can you explain why we only need a single fid
 per vlan, even if we have multiple bridge groups and the same vlan is
 configured in all of them ?

That brings up an interesting point about having multiple bridges with
the same vlan configured.  I struggled with that problem with rocker
also and I don't have an answer other than don't do that.  Or,
better put, if you have multiple bridge on the same vlan, just use one
bridge for that vlan.  Otherwise, I don't know how at the device level
to partition the vlan between the bridges.  Maybe that's what Vivien
is facing also?  I can see how this works for software-only bridges,
because they should be isolated from each other and independent.  But
when offloading to a device which sees VLAN XXX global across the
entire switch, I don't see how we can preserve the bridge boundaries.

I hope I'm not misunderstanding the issue here; if I am, I apologize.

-scott
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] be2net breaks when dma_alloc_coherent memory is not zeroed out

2015-06-02 Thread Joerg Roedel
Hi,

yesterday I bisected an issue with one of my be2net adapters and AMD
IOMMU enabled. In 4.1-rc it suddenly broke and didn't initialize
anymore. It turned out that the be2net driver breaks when the memory
returned from dma_alloc_coherent is not zeroed out. I introduced that
change to the AMD IOMMU driver for v4.1, other DMA-API implementations
for x86 still zero out the memory.

The bug shows like this in dmesg:

 be2net :02:00.0: FW config: function_mode=0x10003, function_caps=0x7
 be2net :02:00.0: FW not responding
 be2net :02:00.0: Unrecoverable Error detected in the adapter
 be2net :02:00.0: Please reboot server to recover
 be2net :02:00.0: UE: MPU bit set

or sometimes as:

 be2net :02:00.1: Waiting for POST, 52s elapsed
 be2net :02:00.1: Waiting for POST, 54s elapsed
 be2net :02:00.1: Waiting for POST, 56s elapsed
 be2net :02:00.1: Waiting for POST, 58s elapsed

But always the result is:

 be2net :02:00.1: Emulex OneConnect(be3) initialization failed
 be2net: probe of :02:00.1 failed with error -110

When the memory returned by dma_alloc_coherent is zeroed out everything
works fine.

But strictly speaking dma_alloc_coherent is not required to zero out the
memory, drivers need to call dma_zalloc_coherent when they need this. So
the behavior of the AMD IOMMU driver is correct.

Can you guys please have a look and remove the assumption that
dma_alloc_coherent returns initialized memory in the be2net driver? In
the future I'd like to optimize out this needless zeroing out of memory
from all IOMMU drivers.

Please let me know if you need further information or if I can help with
testing or anything.

Thanks,

Joerg

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] net/mlx4_core: Fix build failure introduced by the EQ pool changes

2015-06-02 Thread Or Gerlitz
When CONFIG_RFS_ACCEL or SMP aren't set, we fail to build, fix it.

Also, avoid build warning as of unused function on that setup.

Fixes: c66fa19c405a ('net/mlx4: Add EQ pool')
Reported-by: Michael Ellerman m...@ellerman.id.au
Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/eq.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index 1116882..aae13ad 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -221,6 +221,7 @@ static void mlx4_slave_event(struct mlx4_dev *dev, int 
slave,
slave_event(dev, slave, eqe);
 }
 
+#if defined(CONFIG_SMP)
 static void mlx4_set_eq_affinity_hint(struct mlx4_priv *priv, int vec)
 {
int hint_err;
@@ -234,6 +235,7 @@ static void mlx4_set_eq_affinity_hint(struct mlx4_priv 
*priv, int vec)
if (hint_err)
mlx4_warn(dev, irq_set_affinity_hint failed, err %d\n, 
hint_err);
 }
+#endif
 
 int mlx4_gen_pkey_eqe(struct mlx4_dev *dev, int slave, u8 port)
 {
@@ -1207,8 +1209,8 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 MLX4_NUM_ASYNC_EQE + 
MLX4_NUM_SPARE_EQE,
 0, 
priv-eq_table.eq[MLX4_EQ_ASYNC]);
} else {
-#ifdef CONFIG_RFS_ACCEL
struct mlx4_eq  *eq = priv-eq_table.eq[i];
+#ifdef CONFIG_RFS_ACCEL
int port = find_first_bit(eq-actv_ports.ports,
  dev-caps.num_ports) + 1;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Bluetooth: Add reset_resume function

2015-06-02 Thread Oliver Neukum
On Mon, 2015-06-01 at 18:14 -0700, Laura Abbott wrote:
 Bluetooth devices off of some buses such as USB may lose power across
 suspend/resume. When this happens, drivers may need to have the setup
 function called again and behave differently than a cold power on.

Yes, but what is the point? We use reset_resume() to retain
some features of a device across a loss of power.
If power is lost, all settings are gone and all connections
are broken. So what is the difference compared to a plug out/in
cycle?

Regards
Oliver



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/2] cdc_ncm: split the cdc_ncm_ndp funciton

2015-06-02 Thread Enrico Mioso
Split this function in two new ones:
- cdc_ncm_ndp16_find: finds an NDP block in the chain mathcing a supplied
  signature; a pointer to it is returned in case of success;
- cdc_ncm_ndp16_push: create and add to skb a new NDP block;

cdc_ncm_ndp16_push refers to the last NDP visited by cdc_ncm_ndp16_find, hence
this code is stateful.

Signed-Off-By: Enrico Mioso mrkiko...@gmail.com
---
 drivers/net/usb/cdc_ncm.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
index 8067b8f..3c837d6 100644
--- a/drivers/net/usb/cdc_ncm.c
+++ b/drivers/net/usb/cdc_ncm.c
@@ -980,7 +980,7 @@ static void cdc_ncm_align_tail(struct sk_buff *skb, size_t 
modulus, size_t remai
 /* return a pointer to a valid struct usb_cdc_ncm_ndp16 of type sign, possibly
  * allocating a new one within skb
  */
-static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp(struct cdc_ncm_ctx *ctx, struct 
sk_buff *skb, __le32 sign, size_t reserve)
+static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp16_find(struct cdc_ncm_ctx *ctx, 
struct sk_buff *skb, __le32 sign)
 {
struct usb_cdc_ncm_ndp16 *ndp16 = NULL;
struct usb_cdc_ncm_nth16 *nth16 = (void *)skb-data;
@@ -988,12 +988,20 @@ static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp(struct 
cdc_ncm_ctx *ctx, struct sk_
 
/* follow the chain of NDPs, looking for a match */
while (ndpoffset) {
-   ndp16 = (struct usb_cdc_ncm_ndp16 *)(skb-data + ndpoffset);
-   if  (ndp16-dwSignature == sign)
-   return ndp16;
+   ctx-tx_curr_ndp16 = (struct usb_cdc_ncm_ndp16 *)(skb-data + 
ndpoffset);
+   if (ctx-tx_curr_ndp16-dwSignature == sign)
+   ndp16 = ctx-tx_curr_ndp16;
ndpoffset = le16_to_cpu(ndp16-wNextNdpIndex);
}
+  
+  return ndp16;
+}
 
+static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp16_push(struct cdc_ncm_ctx *ctx, 
struct sk_buff *skb, __le32 sign, size_t reserve)
+{
+   struct usb_cdc_ncm_ndp16 *ndp16 = ctx-tx_curr_ndp16;
+   struct usb_cdc_ncm_nth16 *nth16 = (void *)skb-data;
+  
/* align new NDP */
cdc_ncm_align_tail(skb, ctx-tx_ndp_modulus, 0, ctx-tx_max);
 
@@ -1070,11 +1078,15 @@ cdc_ncm_fill_tx_frame(struct usbnet *dev, struct 
sk_buff *skb, __le32 sign)
break;
}
 
-   /* get the appropriate NDP for this skb */
-   ndp16 = cdc_ncm_ndp(ctx, skb_out, sign, skb-len + 
ctx-tx_modulus + ctx-tx_remainder);
-
-   /* align beginning of next frame */
-   cdc_ncm_align_tail(skb_out,  ctx-tx_modulus, 
ctx-tx_remainder, ctx-tx_max);
+   /* search for the appropriate NDP for this skb */
+   ndp16 = cdc_ncm_ndp16_find(ctx, skb_out, sign);
+   
+   if (ndp16 == NULL)
+   {
+ ndp16 = cdc_ncm_ndp16_push(ctx, skb_out, sign, skb-len + 
ctx-tx_modulus + ctx-tx_remainder);
+   }
+else
+ cdc_ncm_align_tail(skb_out,  ctx-tx_modulus, 
ctx-tx_remainder, ctx-tx_max);
 
/* check if we had enough room left for both NDP and frame */
if (!ndp16 || skb_out-len + skb-len  ctx-tx_max) {
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipv4: inet_bind: check the addr_len first

2015-06-02 Thread Denis Kirjanov
Perform the address length check first, before calling
the the proto specific bind() function

Signed-off-by: Denis Kirjanov k...@linux-powerpc.org
---
 net/ipv4/af_inet.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6ad0f7a..333e2fa 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -426,14 +426,15 @@ int inet_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
int chk_addr_ret;
int err;
 
+   err = -EINVAL;
+   if (addr_len  sizeof(struct sockaddr_in))
+   goto out;
+
/* If the socket has its own bind function then use it. (RAW) */
if (sk-sk_prot-bind) {
err = sk-sk_prot-bind(sk, uaddr, addr_len);
goto out;
}
-   err = -EINVAL;
-   if (addr_len  sizeof(struct sockaddr_in))
-   goto out;
 
if (addr-sin_family != AF_INET) {
/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] net: socket: Fix the wrong returns for recvmsg and sendmsg

2015-06-02 Thread Willy Tarreau
On Tue, Jun 02, 2015 at 02:43:54PM +0800, Junling Zheng wrote:
 On 2015/6/2 14:27, Greg KH wrote:
  On Mon, Jun 01, 2015 at 10:23:57PM -0700, David Miller wrote:
  From: Junling Zheng zhengjunl...@huawei.com
  Date: Tue, 2 Jun 2015 12:05:32 +0800
 
  So, the problem commit is 281c9c36 (net: compat: Update
  get_compat_msghdr() to match copy_msghdr_from_user() behaviour),
  which fixes db31c55a6fb2 and brings the get_compat_msghdr() in line
  with copy_msghdr_from_user().
 
  Upstream this got fixed by:
 
  08adb7dabd4874cc5666b4490653b26534702ce0
 
  So the part that makes us not unconditionally return -EFAULT needs
  to be backported, and that's probably equivalent to the patch
  your proposed which therefore should be applied.
  
  Ok, thanks, now applied.
  
 
 Maybe other stable version also needs this fix:)

Yes, from what I'm seeing, at least 3.2 and 2.6.32 need it as well.

Thanks,
Willy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] net/mlx4: need to call close fw if alloc icm is called twice

2015-06-02 Thread Or Gerlitz
On Mon, Jun 1, 2015 at 5:41 PM,  cls...@linux.vnet.ibm.com wrote:
 --- a/drivers/net/ethernet/mellanox/mlx4/main.c
 +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
 @@ -2837,6 +2837,7 @@ slave_start:
   
 existing_vfs,
   reset_flow);

 +   mlx4_close_fw(dev);
 mlx4_cmd_cleanup(dev, MLX4_CMD_CLEANUP_ALL);
 dev-flags = dev_flags;
 if (!SRIOV_VALID_STATE(dev-flags)) {


Acked-by: Or Gerlitz ogerl...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] net/mlx4: double free of dev_vfs

2015-06-02 Thread Or Gerlitz
On Mon, Jun 1, 2015 at 5:41 PM,  cls...@linux.vnet.ibm.com wrote:
 --- a/drivers/net/ethernet/mellanox/mlx4/main.c
 +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
 @@ -2685,6 +2685,7 @@ disable_sriov:
  free_mem:
 dev-persist-num_vfs = 0;
 kfree(dev-dev_vfs);
 +   dev-dev_vfs = NULL;
 return dev_flags  ~MLX4_FLAG_MASTER;
  }

Acked-by: Or Gerlitz ogerl...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/10] net: thunderx: fix problems reported by static check tools

2015-06-02 Thread Aleksey Makarov
These are fixes for the problems that were reported by static check tools.

Aleksey Makarov (9):
  net: thunderx: fix constants
  net: thunderx: introduce a function for mailbox access
  net: thunderx: rework mac address handling
  net: thunderx: delete unused variables
  net: thunderx: add static
  net: thunderx: fix nicvf_set_rxfh()
  net: thunderx: remove unneeded type conversions
  net: thunderx: check if memory allocation was successful
  net: thunderx: use GFP_KERNEL in thread context

Robert Richter (1):
  net: thunderx: Cleanup duplicate NODE_ID macros, add nic_get_node_id()

 drivers/net/ethernet/cavium/thunder/nic.h  | 16 +++--
 drivers/net/ethernet/cavium/thunder/nic_main.c | 12 +---
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c|  3 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 73 +++---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  9 +--
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c  | 18 +++---
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h  |  7 +--
 7 files changed, 67 insertions(+), 71 deletions(-)

-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Eric W. Biederman
Robert Shearman rshea...@brocade.com writes:

 In order to be able to function as a Label Edge Router in an MPLS
 network, it is necessary to be able to take IP packets and impose an
 MPLS encap and forward them out. The traditional approach of setting
 up an interface for each tunnel endpoint doesn't scale for the
 common MPLS use-cases where each IP route tends to be assigned a
 different label as encap.

 The solution suggested here for further discussion is to provide the
 facility to define encap data on a per-nexthop basis using a new
 netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
 forwarding code, but interpreted by the virtual interface assigned to
 the nexthop.

 A new ipmpls interface type is defined to show the use of this
 facility to allow IP packets to be imposed with an MPLS
 encap. However, the facility is designed to be general enough to be
 used by any encapsulation/tunneling mechanism that has similar
 requirements of high-scale, high-variation-of-encap.

I am still digging into the details but adding a new network device to
make this possible if very undesirable.

It is a pain point.  Those network devices get to be a major source of
memory consumption when there are 4K network namespaces in existence.

It is conceptually wrong.  The network device will never be used as an
ordinary network device.  All the network device gives you is the
ability to avoid creating an enumeration of different kinds of
encapsulation.

Eric
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 5/5] rocker: remove support for legacy VLAN ndo ops

2015-06-02 Thread Scott Feldman
On Tue, Jun 2, 2015 at 9:58 AM, roopa ro...@cumulusnetworks.com wrote:
 On 6/2/15, 7:30 AM, Scott Feldman wrote:

 On Tue, Jun 2, 2015 at 4:43 AM, Jamal Hadi Salim j...@mojatatu.com wrote:

 On 06/02/15 03:10, Scott Feldman wrote:


 Actually, we're now consistent with bridge man page which says master
 is the default.

 Want we want, I believe, is to adjust what the man page says (and the
 bridge vlan command itself), by making the default master and self.
 The kernel and driver are fine, it's the default in the bridge command
 that needs adjusting.  Once we do this, we'll be back to transparent
 with software-only bridge.

 Question to ask when looking at something of this nature:
 Will it work with no suprises if you used today's unmodified app?
 The default behavior shouldnt change and unfortunately it does here.

 The default behavior does change, yes, but there shouldn't be any
 surprises even if using today's unmodified app.  The reason why is no
 in-kernel driver is using ndo_bridge_setlink for VLAN setup.  The
 three drivers that have ndo_bridge_setlink use if to set hwmode to
 VEBA|VEB.  For VLAN setup, they use the (default master) bridge's
 ndo_bridge_setlink-ndo_vlan_rx_add_vid.  If the default changes from
 master to master|self, the bridge's
 ndo_bridge_setlink-ndo_vlan_rx_add_vid is still called for those
 driver's using ndo_vlan_rx_add_vid, and if they implement
 ndo_bridge_setlink, they'll get called a second time but will noop
 because there will be no IFLA_BRIDGE_MODE (hwmode) attr to process.

 So it comes down to two choices:

 1) break ABI, which is inconsequential for in-kernel drivers and
 preserve (iproute2) command transparency, or

 2) embrace existing behavior which is consistent with man pages but
 breaks command transparency for any driver implementing
 ndo_bridge_setlink for VLAN setup, which currently is just rocker.  I
 can see the DSA going down this path also based on another concurrent
 thread.

 We're at option 2) right now.

 It is not just iproute2 - since this is breaking ABI expectations.
 Looking at some app i wrote a while back based on analyzing kernel
 expectations at the time, I see the following logic:

 user can set master or self on command line.
 ...
 
 if (user DID NOT set master_on || user set self on)
 then set self to on

 iow, current behavior:
   01: master is only set if user explicitly asked.
   11: master|self when user explicitly sets both
   10: self is on by default when the user doesnt specify anything
   00: and the last option is to have none set which is not
   possible since we have defaults.

 cheers,
 jamal


 So this is very similar to iproute2 - if nothing is set
 it defaults to self.

 Ha, you're giving the behavior for bridge fdb command, where self is
 the default.


 Oh...i did not realize this was the case either. Thats unfortunate.


 For bridge link and bridge vlan, the default is master.  The user
 must explicitly specify self to act on the device side of the port.

 It's unfortunate the iproute2 defaults aren't consistent between
 commands.  Maybe someone knows the history here and can explain.



 scott, this brings back the discussion you and i had over the revert of my
 patches.. (commit id's at the end of this email)...
 which used to seamlessly offload to switchdev from bridge driver if the port
 was a switch port (similar to stp state offload).

Your patch tried to do the same thing that the bridge's
ndo_bridge_setlink/dellink is doing which is using the handler for
MASTER to also set SELF stuff, when SELF was not specified.  I don't
feel we should be overriding the application defaults in the kernel;
instead, we should change the application if we want different
behavior.  The kernel should treat the two sides of the port
independent (that's the basic algo in rtnetlink.c handlers for
MASTER/SELF things).  When you start doing kernel SELF things in the
MASTER path, the application has lost the ability to address each side
of the port independently.

 'self' used to exist before switchdev infra came in. My suggestion was to
 use it where required...but not build the switchdev api on the presence of
 'self'. switchdev layer should be consistent across...all fib/fdb/neigh
 layers.

I don't understand why you're bringing up fib/neigh because there is
no master|self form for those.

The master|self objects are bridge fdb, settings, and vlans.  To be
clear, they are PF_BRIDGE handlers for:

PF_BRIDGE:RTM_NEWNEIGH: add fdb entry
PF_BRIDGE:RTM_DELNEIGH: del fdb entry
PF_BRIDGE:RTM_SETLINK: set bridge setting or add VLAN
PF_BRIDGE:RTM_DELLINK: del VLAN

The net/core/rtnetlink.c code for these _is_ consistent right now.
They all perform this same basic algorithm:

handler()
if (!flags || flags  MASTER)
if (master  master-op-foo)
master-op-foo();
if (flags  SELF)
if (port-op-foo)
port-op-foo();

This lets the application set MASTER and/or SELF 

Re: [PATCH v4 00/25] Convert the posix_clock_operations and k_clock structure to ready for 2038

2015-06-02 Thread Thomas Gleixner
On Mon, 1 Jun 2015, Baolin Wang wrote:

You failed to thread the patch series again 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] net: thunderx: rework mac address handling

2015-06-02 Thread Aleksey Makarov
This fixes sparse message:

drivers/net/ethernet/cavium/thunder/nicvf_main.c:385:40: sparse: cast to
restricted __le64

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nic.h | 4 ++--
 drivers/net/ethernet/cavium/thunder/nic_main.c| 8 +---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c  | 8 ++--
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++--
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 4 ++--
 5 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 4f426db..6479ce2 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -301,7 +301,7 @@ struct nic_cfg_msg {
u8vf_id;
u8tns_mode;
u8node_id;
-   u64   mac_addr;
+   u8mac_addr[ETH_ALEN];
 };
 
 /* Qset configuration */
@@ -331,7 +331,7 @@ struct sq_cfg_msg {
 struct set_mac_msg {
u8msg;
u8vf_id;
-   u64   addr;
+   u8mac_addr[ETH_ALEN];
 };
 
 /* Set Maximum frame size */
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c 
b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 3ca7ad8..6e0c031 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -492,7 +492,6 @@ static void nic_handle_mbx_intr(struct nicpf *nic, int vf)
u64 *mbx_data;
u64 mbx_addr;
u64 reg_addr;
-   u64 mac_addr;
int bgx, lmac;
int i;
int ret = 0;
@@ -555,12 +554,7 @@ static void nic_handle_mbx_intr(struct nicpf *nic, int vf)
lmac = mbx.mac.vf_id;
bgx = NIC_GET_BGX_FROM_VF_LMAC_MAP(nic-vf_lmac_map[lmac]);
lmac = NIC_GET_LMAC_FROM_VF_LMAC_MAP(nic-vf_lmac_map[lmac]);
-#ifdef __BIG_ENDIAN
-   mac_addr = cpu_to_be64(mbx.nic_cfg.mac_addr)  16;
-#else
-   mac_addr = cpu_to_be64(mbx.nic_cfg.mac_addr)  16;
-#endif
-   bgx_set_lmac_mac(nic-node, bgx, lmac, (u8 *)mac_addr);
+   bgx_set_lmac_mac(nic-node, bgx, lmac, mbx.mac.mac_addr);
break;
case NIC_MBOX_MSG_SET_MAX_FRS:
ret = nic_update_hw_frs(nic, mbx.frs.max_frs,
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 989f005..54bba86 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -197,8 +197,7 @@ static void  nicvf_handle_mbx_intr(struct nicvf *nic)
nic-vf_id = mbx.nic_cfg.vf_id  0x7F;
nic-tns_mode = mbx.nic_cfg.tns_mode  0x7F;
nic-node = mbx.nic_cfg.node_id;
-   ether_addr_copy(nic-netdev-dev_addr,
-   (u8 *)mbx.nic_cfg.mac_addr);
+   ether_addr_copy(nic-netdev-dev_addr, mbx.nic_cfg.mac_addr);
nic-link_up = false;
nic-duplex = 0;
nic-speed = 0;
@@ -248,13 +247,10 @@ static void  nicvf_handle_mbx_intr(struct nicvf *nic)
 static int nicvf_hw_set_mac_addr(struct nicvf *nic, struct net_device *netdev)
 {
union nic_mbx mbx = {};
-   int i;
 
mbx.mac.msg = NIC_MBOX_MSG_SET_MAC;
mbx.mac.vf_id = nic-vf_id;
-   for (i = 0; i  ETH_ALEN; i++)
-   mbx.mac.addr = (mbx.mac.addr  8) |
-netdev-dev_addr[i];
+   ether_addr_copy(mbx.mac.mac_addr, netdev-dev_addr);
 
return nicvf_send_msg_to_pf(nic, mbx);
 }
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c 
b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index cde604a..a58924c 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -163,7 +163,7 @@ void bgx_get_lmac_link_state(int node, int bgx_idx, int 
lmacid, void *status)
 }
 EXPORT_SYMBOL(bgx_get_lmac_link_state);
 
-const char *bgx_get_lmac_mac(int node, int bgx_idx, int lmacid)
+const u8 *bgx_get_lmac_mac(int node, int bgx_idx, int lmacid)
 {
struct bgx *bgx = bgx_vnic[(node * MAX_BGX_PER_CN88XX) + bgx_idx];
 
@@ -174,7 +174,7 @@ const char *bgx_get_lmac_mac(int node, int bgx_idx, int 
lmacid)
 }
 EXPORT_SYMBOL(bgx_get_lmac_mac);
 
-void bgx_set_lmac_mac(int node, int bgx_idx, int lmacid, const char *mac)
+void bgx_set_lmac_mac(int node, int bgx_idx, int lmacid, const u8 *mac)
 {
struct bgx *bgx = bgx_vnic[(node * MAX_BGX_PER_CN88XX) + bgx_idx];
 
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h 
b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
index f9e2170..ba4f53b 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
@@ -183,8 +183,8 @@ enum MCAST_MODE {
 void 

[PATCH 07/10] net: thunderx: fix nicvf_set_rxfh()

2015-06-02 Thread Aleksey Makarov
This fixes a copypaste bug that was discovered by a static analysis
tool:

The patch 4863dea3fab0: net: Adding support for Cavium ThunderX
network controller from May 26, 2015, leads to the following static
checker warning:

drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c:517
nicvf_set_rxfh()
warn: we tested 'hkey' before and it was 'false'

drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
   506  /* We do not allow change in unsupported parameters */
   507  if (hkey ||

We return here.

   508  (hfunc != ETH_RSS_HASH_NO_CHANGE  hfunc !=
ETH_RSS_HASH_TOP))
   509  return -EOPNOTSUPP;
   510
   511  rss-enable = true;
   512  if (indir) {
   513  for (idx = 0; idx  rss-rss_size; idx++)
   514  rss-ind_tbl[idx] = indir[idx];
   515  }
   516
   517  if (hkey) {

So this is dead code.

   518  memcpy(rss-key, hkey, RSS_HASH_KEY_SIZE *
sizeof(u64));
   519  nicvf_set_rss_key(nic);
   520  }
   521
   522  nicvf_config_rss(nic);
   523  return 0;
   524  }

regards,
dan carpenter

Reported-by: Dan Carpenter dan.carpen...@oracle.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
index 0fc4a53..16bd2d7 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
@@ -504,8 +504,7 @@ static int nicvf_set_rxfh(struct net_device *dev, const u32 
*indir,
}
 
/* We do not allow change in unsupported parameters */
-   if (hkey ||
-   (hfunc != ETH_RSS_HASH_NO_CHANGE  hfunc != ETH_RSS_HASH_TOP))
+   if (hfunc != ETH_RSS_HASH_NO_CHANGE  hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
 
rss-enable = true;
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] net: thunderx: introduce a function for mailbox access

2015-06-02 Thread Aleksey Makarov
This fixes sparse message:

drivers/net/ethernet/cavium/thunder/nicvf_main.c:153:25: sparse: cast to
restricted __le64

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 27 +++-
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index f81182c..989f005 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -110,17 +110,23 @@ u64 nicvf_queue_reg_read(struct nicvf *nic, u64 offset, 
u64 qidx)
 
 /* VF - PF mailbox communication */
 
+static void nicvf_write_to_mbx(struct nicvf *nic, union nic_mbx *mbx)
+{
+   u64 *msg = (u64 *)mbx;
+
+   nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 0, msg[0]);
+   nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 8, msg[1]);
+}
+
 int nicvf_send_msg_to_pf(struct nicvf *nic, union nic_mbx *mbx)
 {
int timeout = NIC_MBOX_MSG_TIMEOUT;
int sleep = 10;
-   u64 *msg = (u64 *)mbx;
 
nic-pf_acked = false;
nic-pf_nacked = false;
 
-   nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 0, msg[0]);
-   nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 8, msg[1]);
+   nicvf_write_to_mbx(nic, mbx);
 
/* Wait for previous message to be acked, timeout 2sec */
while (!nic-pf_acked) {
@@ -146,12 +152,13 @@ int nicvf_send_msg_to_pf(struct nicvf *nic, union nic_mbx 
*mbx)
 static int nicvf_check_pf_ready(struct nicvf *nic)
 {
int timeout = 5000, sleep = 20;
+   union nic_mbx mbx = {};
+
+   mbx.msg.msg = NIC_MBOX_MSG_READY;
 
nic-pf_ready_to_rcv_msg = false;
 
-   nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 0,
-   le64_to_cpu(NIC_MBOX_MSG_READY));
-   nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 8, 1ULL);
+   nicvf_write_to_mbx(nic, mbx);
 
while (!nic-pf_ready_to_rcv_msg) {
msleep(sleep);
@@ -368,7 +375,9 @@ int nicvf_set_real_num_queues(struct net_device *netdev,
 static int nicvf_init_resources(struct nicvf *nic)
 {
int err;
-   u64 mbx_addr = NIC_VF_PF_MAILBOX_0_1;
+   union nic_mbx mbx = {};
+
+   mbx.msg.msg = NIC_MBOX_MSG_CFG_DONE;
 
/* Enable Qset */
nicvf_qset_config(nic, true);
@@ -382,9 +391,7 @@ static int nicvf_init_resources(struct nicvf *nic)
}
 
/* Send VF config done msg to PF */
-   nicvf_reg_write(nic, mbx_addr, le64_to_cpu(NIC_MBOX_MSG_CFG_DONE));
-   mbx_addr += (NIC_PF_VF_MAILBOX_SIZE - 1) * 8;
-   nicvf_reg_write(nic, mbx_addr, 1ULL);
+   nicvf_write_to_mbx(nic, mbx);
 
return 0;
 }
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Intel-wired-lan] [PATCH 1/2] pci: Add dev_flags bit to access VPD through function 0

2015-06-02 Thread Rustad, Mark D
 On Jun 2, 2015, at 10:48 AM, Alexander Duyck alexander.h.du...@redhat.com 
 wrote:
 
 I'm pretty sure these could cause some serious errors if you direct assign 
 the device into a VM since you then end up with multiple devices sharing a 
 bus.  Also it would likely have side-effects on a LOM (Lan On Motherboard) as 
 it also shares the bus with multiple non-Ethernet devices.
 
 I believe you still need to add something like a check for 
 !pci_is_root_bus(dev-bus) before you attempt to grab function 0.  It 
 probably also wouldn't hurt to check the dev-multifunction bit before 
 running this code since it wouldn't make sense to go chasing down the VPD on 
 another function if the device doesn't have one.  You could probably do that 
 either as a part of this code, or perhaps put it in the quirk.

I'll look into those. I think you are right about more checks being needed. 
Thanks for the comments.

--
Mark Rustad, Networking Division, Intel Corporation



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Eric W. Biederman
roopa ro...@cumulusnetworks.com writes:

 On 6/1/15, 9:46 AM, Robert Shearman wrote:
 In order to be able to function as a Label Edge Router in an MPLS
 network, it is necessary to be able to take IP packets and impose an
 MPLS encap and forward them out. The traditional approach of setting
 up an interface for each tunnel endpoint doesn't scale for the
 common MPLS use-cases where each IP route tends to be assigned a
 different label as encap.

 The solution suggested here for further discussion is to provide the
 facility to define encap data on a per-nexthop basis using a new
 netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
 forwarding code, but interpreted by the virtual interface assigned to
 the nexthop.

 A new ipmpls interface type is defined to show the use of this
 facility to allow IP packets to be imposed with an MPLS
 encap. However, the facility is designed to be general enough to be
 used by any encapsulation/tunneling mechanism that has similar
 requirements of high-scale, high-variation-of-encap.

 RFC because:
   - IPv6 side not implemented
   - struct rtable shouldn't be bloated by pointer+uint
   - Hasn't been thoroughly tested yet

 Robert Shearman (3):
net: infra for per-nexthop encap data
ipv4: storing and retrieval of per-nexthop encap
mpls: new ipmpls device for encapsulating IP packets as mpls


 Glad to see these patches!.
 I have a similar series i have been working on...but no netdevice.
 A set of ops similar to iptun_encaps and I store encap data in fib_nh
 and in ip_route_output_slow i point the dst.output to the output func provided
 by one of the encap ops.

 I see the advantages of using a netdevice...and i see this align with patches
 from thomas.

roopa I think I would prefer your patches.  I thinking using a netdevice
the way Robert is proposing is quite possibly a mess, from a scalability
stand point.

Do you mean ip_route_input_slow?  There is no ip_route_output_slow.

Eric

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] net: thunderx: delete unused variables

2015-06-02 Thread Aleksey Makarov
They were left from development stage

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c 
b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index a58924c..83476f0 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -38,7 +38,7 @@ struct lmac {
boolis_sgmii;
struct delayed_work dwork;
struct workqueue_struct *check_link;
-} lmac;
+};
 
 struct bgx {
u8  bgx_id;
@@ -50,7 +50,7 @@ struct bgx {
int use_training;
void __iomem*reg_base;
struct pci_dev  *pdev;
-} bgx;
+};
 
 struct bgx *bgx_vnic[MAX_BGX_THUNDER];
 static int lmac_count; /* Total no of LMACs in system */
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] net: thunderx: check if memory allocation was successful

2015-06-02 Thread Aleksey Makarov
This fixes a coccinelle warning:

coccinelle warnings: (new ones prefixed by )

 drivers/net/ethernet/cavium/thunder/nicvf_queues.c:360:1-11: alloc
 with no test, possible model on line 367

vim +360 drivers/net/ethernet/cavium/thunder/nicvf_queues.c

   354  err = nicvf_alloc_q_desc_mem(nic, sq-dmem, q_len,
SND_QUEUE_DESC_SIZE,
   355   NICVF_SQ_BASE_ALIGN_BYTES);
   356  if (err)
   357  return err;
   358
   359  sq-desc = sq-dmem.base;
  360  sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_ATOMIC);
   361  sq-head = 0;
   362  sq-tail = 0;
   363  atomic_set(sq-free_cnt, q_len - 1);
   364  sq-thresh = SND_QUEUE_THRESH;
   365
   366  /* Preallocate memory for TSO segment's header */
  367  sq-tso_hdrs = dma_alloc_coherent(nic-pdev-dev,
   368q_len *
TSO_HEADER_SIZE,
   369sq-tso_hdrs_phys,
GFP_KERNEL);
   370  if (!sq-tso_hdrs)

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 8929029..2ed7d1b 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -357,6 +357,8 @@ static int nicvf_init_snd_queue(struct nicvf *nic,
 
sq-desc = sq-dmem.base;
sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_ATOMIC);
+   if (!sq-skbuff)
+   return -ENOMEM;
sq-head = 0;
sq-tail = 0;
atomic_set(sq-free_cnt, q_len - 1);
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] net: thunderx: use GFP_KERNEL in thread context

2015-06-02 Thread Aleksey Makarov
GFP_KERNEL should be used in the thread context

Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 2ed7d1b..d69d228 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -356,7 +356,7 @@ static int nicvf_init_snd_queue(struct nicvf *nic,
return err;
 
sq-desc = sq-dmem.base;
-   sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_ATOMIC);
+   sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_KERNEL);
if (!sq-skbuff)
return -ENOMEM;
sq-head = 0;
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] net: thunderx: remove unneeded type conversions

2015-06-02 Thread Aleksey Makarov
No need to cast void* to u8*: pointer arithmetics
works same way for both.

Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 7f0e108..8929029 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -62,8 +62,7 @@ static int nicvf_alloc_q_desc_mem(struct nicvf *nic, struct 
q_desc_mem *dmem,
 
/* Align memory address for 'align_bytes' */
dmem-phys_base = NICVF_ALIGNED_ADDR((u64)dmem-dma, align_bytes);
-   dmem-base = (void *)((u8 *)dmem-unalign_base +
- (dmem-phys_base - dmem-dma));
+   dmem-base = dmem-unalign_base + (dmem-phys_base - dmem-dma);
return 0;
 }
 
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] net: thunderx: add static

2015-06-02 Thread Aleksey Makarov
This fixes sparse messages like this:

drivers/net/ethernet/cavium/thunder/nicvf_main.c:1141:26: sparse: symbol
'nicvf_get_stats64' was not declared. Should it be static?

Also remove unused declarations

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nic.h  |  2 --
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 28 ++
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  2 +-
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c  |  6 ++---
 4 files changed, 16 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 6479ce2..a3b43e5 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -413,10 +413,8 @@ int nicvf_set_real_num_queues(struct net_device *netdev,
 int nicvf_open(struct net_device *netdev);
 int nicvf_stop(struct net_device *netdev);
 int nicvf_send_msg_to_pf(struct nicvf *vf, union nic_mbx *mbx);
-void nicvf_config_cpi(struct nicvf *nic);
 void nicvf_config_rss(struct nicvf *nic);
 void nicvf_set_rss_key(struct nicvf *nic);
-void nicvf_free_skb(struct nicvf *nic, struct sk_buff *skb);
 void nicvf_set_ethtool_ops(struct net_device *netdev);
 void nicvf_update_stats(struct nicvf *nic);
 void nicvf_update_lmac_stats(struct nicvf *nic);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 54bba86..02da802 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -50,10 +50,6 @@ module_param(cpi_alg, int, S_IRUGO);
 MODULE_PARM_DESC(cpi_alg,
 PFC algorithm (0=none, 1=VLAN, 2=VLAN16, 3=IP Diffserv));
 
-static int nicvf_enable_msix(struct nicvf *nic);
-static netdev_tx_t nicvf_xmit(struct sk_buff *skb, struct net_device *netdev);
-static void nicvf_read_bgx_stats(struct nicvf *nic, struct bgx_stats_msg *bgx);
-
 static inline void nicvf_set_rx_frame_cnt(struct nicvf *nic,
  struct sk_buff *skb)
 {
@@ -174,6 +170,14 @@ static int nicvf_check_pf_ready(struct nicvf *nic)
return 1;
 }
 
+static void nicvf_read_bgx_stats(struct nicvf *nic, struct bgx_stats_msg *bgx)
+{
+   if (bgx-rx)
+   nic-bgx_stats.rx_stats[bgx-idx] = bgx-stats;
+   else
+   nic-bgx_stats.tx_stats[bgx-idx] = bgx-stats;
+}
+
 static void  nicvf_handle_mbx_intr(struct nicvf *nic)
 {
union nic_mbx mbx = {};
@@ -255,7 +259,7 @@ static int nicvf_hw_set_mac_addr(struct nicvf *nic, struct 
net_device *netdev)
return nicvf_send_msg_to_pf(nic, mbx);
 }
 
-void nicvf_config_cpi(struct nicvf *nic)
+static void nicvf_config_cpi(struct nicvf *nic)
 {
union nic_mbx mbx = {};
 
@@ -267,7 +271,7 @@ void nicvf_config_cpi(struct nicvf *nic)
nicvf_send_msg_to_pf(nic, mbx);
 }
 
-void nicvf_get_rss_size(struct nicvf *nic)
+static void nicvf_get_rss_size(struct nicvf *nic)
 {
union nic_mbx mbx = {};
 
@@ -575,7 +579,7 @@ static int nicvf_poll(struct napi_struct *napi, int budget)
  *
  * As of now only CQ errors are handled
  */
-void nicvf_handle_qs_err(unsigned long data)
+static void nicvf_handle_qs_err(unsigned long data)
 {
struct nicvf *nic = (struct nicvf *)data;
struct queue_set *qs = nic-qs;
@@ -1043,14 +1047,6 @@ static int nicvf_set_mac_address(struct net_device 
*netdev, void *p)
return 0;
 }
 
-static void nicvf_read_bgx_stats(struct nicvf *nic, struct bgx_stats_msg *bgx)
-{
-   if (bgx-rx)
-   nic-bgx_stats.rx_stats[bgx-idx] = bgx-stats;
-   else
-   nic-bgx_stats.tx_stats[bgx-idx] = bgx-stats;
-}
-
 void nicvf_update_lmac_stats(struct nicvf *nic)
 {
int stat = 0;
@@ -1141,7 +1137,7 @@ void nicvf_update_stats(struct nicvf *nic)
nicvf_update_sq_stats(nic, qidx);
 }
 
-struct rtnl_link_stats64 *nicvf_get_stats64(struct net_device *netdev,
+static struct rtnl_link_stats64 *nicvf_get_stats64(struct net_device *netdev,
struct rtnl_link_stats64 *stats)
 {
struct nicvf *nic = netdev_priv(netdev);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 1962466..7f0e108 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -228,7 +228,7 @@ static void nicvf_free_rbdr(struct nicvf *nic, struct rbdr 
*rbdr)
 
 /* Refill receive buffer descriptors with new buffers.
  */
-void nicvf_refill_rbdr(struct nicvf *nic, gfp_t gfp)
+static void nicvf_refill_rbdr(struct nicvf *nic, gfp_t gfp)
 {
struct queue_set *qs = nic-qs;
int rbdr_idx = qs-rbdr_cnt;
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c 

Re: [ovs-dev] [net-next RFC 00/14] Convert OVS tunnel vports to use regular net_devices

2015-06-02 Thread Flavio Leitner

It seems patch 01 didn't make it to ovs dev mailing list,
but it is available on netdev mailing list.
fbl

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] net: thunderx: fix constants

2015-06-02 Thread Aleksey Makarov
This fixes sparse messages like this:

drivers/net/ethernet/cavium/thunder/thunder_bgx.c:897:24: sparse:
constant 0x3000 is so big it is long

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index abd446e6..f81182c 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -326,11 +326,11 @@ static int nicvf_rss_init(struct nicvf *nic)
rss-enable = true;
 
/* Using the HW reset value for now */
-   rss-key[0] = 0xFEED0BADFEED0BAD;
-   rss-key[1] = 0xFEED0BADFEED0BAD;
-   rss-key[2] = 0xFEED0BADFEED0BAD;
-   rss-key[3] = 0xFEED0BADFEED0BAD;
-   rss-key[4] = 0xFEED0BADFEED0BAD;
+   rss-key[0] = 0xFEED0BADFEED0BADULL;
+   rss-key[1] = 0xFEED0BADFEED0BADULL;
+   rss-key[2] = 0xFEED0BADFEED0BADULL;
+   rss-key[3] = 0xFEED0BADFEED0BADULL;
+   rss-key[4] = 0xFEED0BADFEED0BADULL;
 
nicvf_set_rss_key(nic);
 
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls

2015-06-02 Thread Eric W. Biederman
Robert Shearman rshea...@brocade.com writes:

 Allow creating an mpls device for the purposes of encapsulating IP
 packets with:

   ip link add type ipmpls

 This device defines its per-nexthop encapsulation data as a stack of
 labels, in the same format as for RTA_NEWST. It uses the encap data
 which will have been stored in the IP route to encapsulate the packet
 with that stack of labels, with the last label corresponding to a
 local label that defines how the packet will be sent out. The device
 sends packets over loopback to the local MPLS forwarding logic which
 performs all of the work.

 Stats are implemented, although any error in the sending via the real
 interface will be handled by the main mpls forwarding code and so not
 accounted by the interface.

Eeek stats!  Lots of unnecessary overhead.  If stats were ok we could
have simply reduced the cost of struct net_device to the point where it
would not matter.

This is really a bad hack for not getting in and being able to set
dst_output the way the xfrm infrastructure does.

What we really want here is xfrm-lite.  By lite I mean the tunnel
selection criteria is simple enough that it fits into the normal
routing table instead of having to do weird flow based magic that
is rarely needed.

I believe what we want are the xfrm stacking of dst entries.

Eric


 This implementation is based on an alternative earlier implementation
 by Eric W. Biederman.

 Signed-off-by: Robert Shearman rshea...@brocade.com
 ---
  include/uapi/linux/if_arp.h |   1 +
  net/mpls/Kconfig|   5 +
  net/mpls/Makefile   |   1 +
  net/mpls/af_mpls.c  |   2 +
  net/mpls/ipmpls.c   | 284 
 
  5 files changed, 293 insertions(+)
  create mode 100644 net/mpls/ipmpls.c

 diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
 index 4d024d75d64b..17d669fd1781 100644
 --- a/include/uapi/linux/if_arp.h
 +++ b/include/uapi/linux/if_arp.h
 @@ -88,6 +88,7 @@
  #define ARPHRD_IEEE80211_RADIOTAP 803/* IEEE 802.11 + radiotap 
 header */
  #define ARPHRD_IEEE802154  804
  #define ARPHRD_IEEE802154_MONITOR 805/* IEEE 802.15.4 network 
 monitor */
 +#define ARPHRD_MPLS  806 /* IP and IPv6 over MPLS tunnels */
  
  #define ARPHRD_PHONET820 /* PhoNet media type
 */
  #define ARPHRD_PHONET_PIPE 821   /* PhoNet pipe header   
 */
 diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
 index 17bde799c854..5264da94733a 100644
 --- a/net/mpls/Kconfig
 +++ b/net/mpls/Kconfig
 @@ -27,4 +27,9 @@ config MPLS_ROUTING
   help
Add support for forwarding of mpls packets.
  
 +config MPLS_IPTUNNEL
 + tristate MPLS: IP over MPLS tunnel support
 + help
 +  A network device that encapsulates ip packets as mpls
 +
  endif # MPLS
 diff --git a/net/mpls/Makefile b/net/mpls/Makefile
 index 65bbe68c72e6..3a93c14b23c5 100644
 --- a/net/mpls/Makefile
 +++ b/net/mpls/Makefile
 @@ -3,5 +3,6 @@
  #
  obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
  obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
 +obj-$(CONFIG_MPLS_IPTUNNEL) += ipmpls.o
  
  mpls_router-y := af_mpls.o
 diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
 index 7b3f732269e4..68bdfbdddfaf 100644
 --- a/net/mpls/af_mpls.c
 +++ b/net/mpls/af_mpls.c
 @@ -615,6 +615,7 @@ int nla_put_labels(struct sk_buff *skb, int attrtype,
  
   return 0;
  }
 +EXPORT_SYMBOL(nla_put_labels);
  
  int nla_get_labels(const struct nlattr *nla,
  u32 max_labels, u32 *labels, u32 label[])
 @@ -660,6 +661,7 @@ int nla_get_labels(const struct nlattr *nla,
   *labels = nla_labels;
   return 0;
  }
 +EXPORT_SYMBOL(nla_get_labels);
  
  static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
  struct mpls_route_config *cfg)
 diff --git a/net/mpls/ipmpls.c b/net/mpls/ipmpls.c
 new file mode 100644
 index ..cf6894ae0c61
 --- /dev/null
 +++ b/net/mpls/ipmpls.c
 @@ -0,0 +1,284 @@
 +#include linux/types.h
 +#include linux/netdevice.h
 +#include linux/if_vlan.h
 +#include linux/if_arp.h
 +#include linux/ip.h
 +#include linux/ipv6.h
 +#include linux/module.h
 +#include linux/mpls.h
 +#include internal.h
 +
 +static LIST_HEAD(ipmpls_dev_list);
 +
 +#define MAX_NEW_LABELS 2
 +
 +struct ipmpls_dev_priv {
 + struct net_device *out_dev;
 + struct list_head list;
 + struct net_device *dev;
 +};
 +
 +static netdev_tx_t ipmpls_dev_xmit(struct sk_buff *skb, struct net_device 
 *dev)
 +{
 + struct ipmpls_dev_priv *priv = netdev_priv(dev);
 + struct net_device *out_dev = priv-out_dev;
 + struct mpls_shim_hdr *hdr;
 + bool bottom_of_stack = true;
 + int len = skb-len;
 + const void *encap;
 + int num_labels;
 + unsigned ttl;
 + const u32 *labels;
 + int ret;
 + int i;
 +
 + num_labels = dst_get_encap(skb, encap) / 4;
 + if (!num_labels)
 + goto drop;
 +
 +

[PATCH] drivers/net/ethernet/dec/tulip/uli526x.c: fix misleading indentation in uli526x_timer

2015-06-02 Thread David Malcolm
This code in drivers/net/ethernet/dec/tulip/uli526x.c
function uli526x_timer:

  1086  } else
  1087  if ((tmp_cr12  0x3)  db-link_failed) {
  [...snip...]
  1109  }
  1110  else if(!(tmp_cr12  0x3)  db-link_failed)
    {
  [...snip...]
  1117  }
  1118  db-init=0;

is misleadingly indented: the
  db-init=0
is indented as if part of the else clause at line 1086, but it is
independent of it (no braces before the if at line 1087).

This patch fixes the indentation to reflect the actual meaning of the code,
though is it actually meant to be part of the else clause?  (I'm a
compiler developer, not a kernel person).  It also adds spaces around
the assignment, to placate checkpatch.pl.

Seen via an experimental new gcc warning I'm working on for gcc 6,
-Wmisleading-indentation, using gcc r223098 adding
-Werror=misleading-indentation to KBUILD_CFLAGS in Makefile.
The experimental GCC emits this warning (as an error), rightly IMHO:

drivers/net/ethernet/dec/tulip/uli526x.c: In function ‘uli526x_timer’:
drivers/net/ethernet/dec/tulip/uli526x.c:1118:3: error: statement is
indented as if it were guarded by... [-Werror=misleading-indentation]
   db-init=0;
^
drivers/net/ethernet/dec/tulip/uli526x.c:1086:4: note: ...this ‘else’
clause, but it is not
  } else
 ^

Hope this is helpful
Dave

Signed-off-by: David Malcolm dmalc...@redhat.com
---
 drivers/net/ethernet/dec/tulip/uli526x.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/dec/tulip/uli526x.c 
b/drivers/net/ethernet/dec/tulip/uli526x.c
index 2c30c0c..447d092 100644
--- a/drivers/net/ethernet/dec/tulip/uli526x.c
+++ b/drivers/net/ethernet/dec/tulip/uli526x.c
@@ -1115,7 +1115,7 @@ static void uli526x_timer(unsigned long data)
netif_carrier_off(dev);
}
}
-   db-init=0;
+   db-init = 0;
 
/* Timer active again */
db-timer.expires = ULI526X_TIMER_WUT;
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next V10 1/4] openvswitch: 802.1ad uapi changes.

2015-06-02 Thread Thomas F Herbert
openvswitch: Add support for 8021.AD

Change the description of the VLAN tpid field.

Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com
---
 include/uapi/linux/openvswitch.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index bbd49a0..f2ccdef 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -559,13 +559,13 @@ struct ovs_action_push_mpls {
  * @vlan_tci: Tag control identifier (TCI) to push.  The CFI bit must be set
  * (but it will not be set in the 802.1Q header that is pushed).
  *
- * The @vlan_tpid value is typically %ETH_P_8021Q.  The only acceptable TPID
- * values are those that the kernel module also parses as 802.1Q headers, to
- * prevent %OVS_ACTION_ATTR_PUSH_VLAN followed by %OVS_ACTION_ATTR_POP_VLAN
- * from having surprising results.
+ * The @vlan_tpid value is typically %ETH_P_8021Q or %ETH_P_8021AD.
+ * The only acceptable TPID values are those that the kernel module also parses
+ * as 802.1Q or 802.1AD headers, to prevent %OVS_ACTION_ATTR_PUSH_VLAN followed
+ * by %OVS_ACTION_ATTR_POP_VLAN from having surprising results.
  */
 struct ovs_action_push_vlan {
-   __be16 vlan_tpid;   /* 802.1Q TPID. */
+   __be16 vlan_tpid;   /* 802.1Q or 802.1ad TPID. */
__be16 vlan_tci;/* 802.1Q TCI (VLAN ID and priority). */
 };
 
@@ -605,9 +605,10 @@ struct ovs_action_hash {
  * is copied from the value to the packet header field, rest of the bits are
  * left unchanged.  The non-masked value bits must be passed in as zeroes.
  * Masking is not supported for the %OVS_KEY_ATTR_TUNNEL attribute.
- * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q header onto the
- * packet.
- * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet.
+ * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q or 802.1ad header
+ * onto the packet.
+ * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q or 802.1ad header
+ * from the packet.
  * @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in
  * the nested %OVS_SAMPLE_ATTR_* attributes.
  * @OVS_ACTION_ATTR_PUSH_MPLS: Push a new MPLS label stack entry onto the
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] net: thunderx: Cleanup duplicate NODE_ID macros, add nic_get_node_id()

2015-06-02 Thread Aleksey Makarov
From: Robert Richter rrich...@cavium.com

There are duplicate NODE_ID macro definitions. Move all of them to
nic.h for usage in nic and bgx driver and introduce nic_get_node_id()
helper function.

This patch also fixes 64bit mask which should have been ULL by
reworking the node calculation.

Signed-off-by: Robert Richter rrich...@cavium.com
---
 drivers/net/ethernet/cavium/thunder/nic.h | 10 ++
 drivers/net/ethernet/cavium/thunder/nic_main.c|  4 +---
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c |  4 ++--
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h |  3 ---
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 9b0be52..4f426db 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -11,6 +11,7 @@
 
 #include linux/netdevice.h
 #include linux/interrupt.h
+#include linux/pci.h
 #include thunder_bgx.h
 
 /* PCI device IDs */
@@ -398,6 +399,15 @@ union nic_mbx {
struct bgx_link_status  link_status;
 };
 
+#define NIC_NODE_ID_MASK   0x03
+#define NIC_NODE_ID_SHIFT  44
+
+static inline int nic_get_node_id(struct pci_dev *pdev)
+{
+   u64 addr = pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM);
+   return ((addr  NIC_NODE_ID_SHIFT)  NIC_NODE_ID_MASK);
+}
+
 int nicvf_set_real_num_queues(struct net_device *netdev,
  int tx_queues, int rx_queues);
 int nicvf_open(struct net_device *netdev);
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c 
b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 0f1f58b..3ca7ad8 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -23,8 +23,6 @@
 struct nicpf {
struct pci_dev  *pdev;
u8  rev_id;
-#define NIC_NODE_ID_MASK   0x3000
-#define NIC_NODE_ID(x) ((x  NODE_ID_MASK)  44)
u8  node;
unsigned intflags;
u8  num_vf_en;  /* No of VF enabled */
@@ -851,7 +849,7 @@ static int nic_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
pci_read_config_byte(pdev, PCI_REVISION_ID, nic-rev_id);
 
-   nic-node = NIC_NODE_ID(pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM));
+   nic-node = nic_get_node_id(pdev);
 
nic_set_lmac_vf_mapping(nic);
 
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c 
b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index 020e11c..cde604a 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -894,8 +894,8 @@ static int bgx_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err_release_regions;
}
bgx-bgx_id = (pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM)  24)  1;
-   bgx-bgx_id += NODE_ID(pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM))
-   * MAX_BGX_PER_CN88XX;
+   bgx-bgx_id += nic_get_node_id(pdev) * MAX_BGX_PER_CN88XX;
+
bgx_vnic[bgx-bgx_id] = bgx;
bgx_get_qlm_mode(bgx);
 
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h 
b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
index 9d91ce4..f9e2170 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
@@ -20,9 +20,6 @@
 
 #defineMAX_LMAC(MAX_BGX_PER_CN88XX * MAX_LMAC_PER_BGX)
 
-#defineNODE_ID_MASK0x3000
-#defineNODE_ID(x)  ((x  NODE_ID_MASK)  44)
-
 /* Registers */
 #define BGX_CMRX_CFG   0x00
 #define  CMR_PKT_TX_EN BIT_ULL(13)
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC 00/14] Convert OVS tunnel vports to use regular net_devices

2015-06-02 Thread Eric W. Biederman
Thomas Graf tg...@suug.ch writes:

 This is the first series in a greater effort to bring the scalability
 and programmability advantages of OVS to the rest of the network
 stack and to get rid of as much OVS specific code as possible.

 This first series focuses on getting rid of OVS tunnel vports and use
 regular tunnel net_devices instead. As part of this effort, the
 routing subsystem is extended with support for flow based tunneling.
 In this new tunneling mode, the route is able to match on tunnel
 information as well as set tunnel encapsulation parameters per route.
 This allows to perform L3 forwarding for a large number of tunnel
 endpoints and virtual networks using a single tunnel net_device.

This is a different direction than I was imagining things evolving when
I was looking at mpls.  However there is a lot of overlap.

I get the imperession there are two directions you are looking at:
- Allowing more configurable keeps in route based lookup.
- Reducing the costs of the tunnels.

We already have a similar subsystem xfrm.

If we are going to use more flexible keys when lookup up routes, if it
is reasonably possible (while maintaining performance) I suggest we use
the xfrm data structure or more likely rework xfrm on top of the new
data structures.  That way there is less code to maintain overall.

Certainly any work that plays with tunnels a new way to do tunnels in
the kernel needs to answer the question.  Why not xfrm.  As xfrm already
exists to do exactly that job.

I think a clumsy api and excess flexibility start to be an answer for
mpls ingress.  Just using the existing routing table can result in
cleaner faster code with a better userspace API.  But I still think the
mpls case where we attach labels needs to answer that case.

If you are using flow based flexibility from openvswitch I think why not
use xfrm becomes a more challenge question to answer.

Eric
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Andy Lutomirski
As far as I can tell, enabling IP_RECVERR causes the presence of a
queued error to cause recvmsg, etc to return an error (once).  It's
worse, though: a new error can be queued asynchronously at any time,
this setting sk_err to a nonzero value.  How do I sensibly distinguish
recvmsg failures to to genuine errors receiving messages from recvmsg
failures because there's a queued error?

The only way I can see to get reliable error handling is to literally
call recvmsg in a loop:

while (true /* or while POLLIN is set */) {
  int ret = recvmsg(..., MSG_ERRQUEUE not set);
  if (ret  0  /* what goes here? */) {
whoops!  this might be a harmless asynchronous error!
take no action!
  }

  /* if POLLERR (or maybe unconditionally), recvmsg(..., MSG_ERRQUEUE);
}

The problem is that, if I'm screwing something up (thus causing EINVAL
or something similar), this will just spin forever.

Am I missing something here?  Would it make sense to add
MSG_IGNORE_ERROR to suppress the sock_error check or IP_RECVERR=2 to
stop setting sk_err?


Thanks,
Andy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] net: thunderx: fix problems reported by static check tools

2015-06-02 Thread David Miller
From: Aleksey Makarov aleksey.maka...@caviumnetworks.com
Date: Tue, 2 Jun 2015 11:00:17 -0700

 These are fixes for the problems that were reported by static check tools.

Series applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next V10 0/4] openvswitch: Add support for 802.1AD

2015-06-02 Thread Thomas F Herbert
Add support for 802.1AD to the openvswitch kernel module.

V10: Implement reviewer comments: Consolidate vlan parsing functions.
Splits netlink parsing and flow conversion into a separate patch. Uses
double encap attribute encapsulation for 802.1ad.  Netlink attributes
now look like this:

eth_type(0x88a8),vlan(vid=100),encap(eth_type(0x8100), vlan(vid=200),
encap(eth_type(0x0800), ...))

The double encap atributes in this version of the patch is incompatible with
old versions of the user level 802.1ad patch. A new user level patch which 
is also being submitted simultaneously to openvswitch dev mailing list.

V9:  Includes changes suggested by reviewers

V8:  Includes changes suggested by reviewers

V7:  Includes changes suggested by reviewers

V6:  Rebased to net-next

V5:  Use encapsulated attributes

Although the Open Flow specification specified support for 802.1AD (qinq)
as well as push and pop vlan headers,  So far Open vSwitch has only
supported a single tag header.

This patch accompanies version 10 of the user level openvswitch patch 
submitted to openvswitch dev list.
For discussion, history  and previous versions of the kernel module 
patch and the user code patch see the OVS dev mailing list, 
openvswitch.org/pipermail/dev/..

Thomas F Herbert (4):
  openvswitch: 802.1ad uapi changes.
  Check for vlan ethernet types for 8021.q or 802.1ad
  8021AD: Flow handling actions and parsing
  8021AD: Flow key parsing and netlink attributes.

 include/linux/if_vlan.h  |   9 ++
 include/uapi/linux/openvswitch.h |  17 ++--
 net/openvswitch/flow.c   |  82 ++---
 net/openvswitch/flow.h   |   3 +
 net/openvswitch/flow_netlink.c   | 186 +--
 5 files changed, 248 insertions(+), 49 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next V10 2/4] General check for vlan ethernet types

2015-06-02 Thread Thomas F Herbert
This patch adds a function to check for vlan ethernet types. There is a
use case in openvswitch and it should be useful elsewhere.

Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com
---
 include/linux/if_vlan.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 920e445..3713454 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -627,5 +627,14 @@ static inline netdev_features_t vlan_features_check(const 
struct sk_buff *skb,
 
return features;
 }
+/**
+ * Check for legal valid vlan ether type.
+ */
+static inline bool eth_type_vlan(__be16 ethertype)
+{
+   if (ethertype == htons(ETH_P_8021Q) || ethertype == htons(ETH_P_8021AD))
+   return true;
+   return false;
+}
 
 #endif /* !(_LINUX_IF_VLAN_H_) */
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next V10 4/4] 8021AD: Flow key parsing and netlink attributes.

2015-06-02 Thread Thomas F Herbert
Add support for 802.1ad to netlink parsing and flow conversation. Uses
double nested encap attributes to represent double tagged vlan.

Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com
---
 net/openvswitch/flow_netlink.c | 186 ++---
 1 file changed, 157 insertions(+), 29 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index c691b1a..8fd4f63 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -771,6 +771,28 @@ static int metadata_from_nlattrs(struct sw_flow_match 
*match,  u64 *attrs,
return 0;
 }
 
+static int cust_vlan_from_nlattrs(struct sw_flow_match *match, u64 attrs,
+ const struct nlattr **a, bool is_mask,
+ bool log)
+{
+   /* This should be nested inner or customer tci */
+   if (attrs  (1  OVS_KEY_ATTR_VLAN)) {
+   __be16 ctci;
+
+   ctci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
+   if (!(ctci  htons(VLAN_TAG_PRESENT))) {
+   if (is_mask)
+   OVS_NLERR(log, VLAN CTCI mask does not have 
exact match for VLAN_TAG_PRESENT bit.);
+   else
+   OVS_NLERR(log, VLAN CTCI does not have 
VLAN_TAG_PRESENT bit set.);
+
+   return -EINVAL;
+   }
+   SW_FLOW_KEY_PUT(match, eth.ctci, ctci, is_mask);
+   }
+   return 0;
+}
+
 static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
const struct nlattr **a, bool is_mask,
bool log)
@@ -1024,6 +1046,105 @@ static void mask_set_nlattr(struct nlattr *attr, u8 val)
nlattr_set(attr, val, ovs_key_lens);
 }
 
+static int parse_vlan_from_nlattrs(const struct nlattr *nla,
+  struct sw_flow_match *match,
+  u64 *key_attrs, bool *ie_valid,
+  const struct nlattr **a, bool is_mask,
+  bool log)
+{
+   int err;
+   __be16 tci;
+   const struct nlattr *encap;
+
+   if (!is_mask) {
+   u64 v_attrs = 0;
+
+   tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
+
+   if (tci  htons(VLAN_TAG_PRESENT)) {
+   if (unlikely((nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]) ==
+ htons(ETH_P_8021AD {
+   err = parse_flow_nlattrs(nla, a, v_attrs, log);
+   if (err)
+   return err;
+   if (!v_attrs)
+   return -EINVAL;
+
+   if (!((v_attrs 
+  (1ULL  OVS_KEY_ATTR_VLAN)) 
+ (v_attrs 
+  (1ULL  OVS_KEY_ATTR_ENCAP {
+   OVS_NLERR(log, Invalid Vlan frame.);
+   return -EINVAL;
+   }
+   v_attrs = ~(1  OVS_KEY_ATTR_ETHERTYPE);
+   encap = a[OVS_KEY_ATTR_ENCAP];
+   v_attrs = ~(1  OVS_KEY_ATTR_ENCAP);
+   *ie_valid = true;
+
+   err = cust_vlan_from_nlattrs(match, v_attrs,
+encap, is_mask,
+log);
+   if (err)
+   return err;
+   /* Insure that tci key attribute isn't
+* overwritten by encapsulated customer tci.
+*/
+   v_attrs = ~(1  OVS_KEY_ATTR_VLAN);
+   *key_attrs |= v_attrs;
+   } else {
+   *key_attrs = ~(1  OVS_KEY_ATTR_VLAN);
+   err = parse_flow_nlattrs(nla, a, key_attrs,
+log);
+   if (err)
+   return err;
+   }
+   } else if (!tci) {
+   /* Corner case for truncated 802.1Q header. */
+   if (nla_len(nla)) {
+   OVS_NLERR(log, Truncated 802.1Q header has 
non-zero encap attribute.);
+   return -EINVAL;
+   }
+   } else {
+   OVS_NLERR(log, Encap attr is set for non-VLAN frame);
+   return  -EINVAL;
+   }
+
+   } else {
+   u64 mask_v_attrs = 0;

[PATCH net-next V10 3/4] 802.1AD: Flow handling, actions and vlan parsing

2015-06-02 Thread Thomas F Herbert
Add support for 802.1ad including the ability to push and pop double
tagged vlans.

Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com
---
 net/openvswitch/flow.c | 82 ++
 net/openvswitch/flow.h |  3 ++
 2 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 2dacc7b..9c73a2e 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -298,21 +298,78 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
 static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
 {
struct qtag_prefix {
-   __be16 eth_type; /* ETH_P_8021Q */
+   __be16 eth_type; /* ETH_P_8021Q  or ETH_P_8021AD */
__be16 tci;
};
-   struct qtag_prefix *qp;
+   struct qtag_prefix *qp = (struct qtag_prefix *)skb-data;
 
-   if (unlikely(skb-len  sizeof(struct qtag_prefix) + sizeof(__be16)))
+   struct qinqtag_prefix {
+   __be16 eth_type; /* ETH_P_8021Q  or ETH_P_8021AD */
+   __be16 tci;
+   __be16 inner_tpid; /* ETH_P_8021Q */
+   __be16 ctci;
+   };
+
+   if (likely(skb_vlan_tag_present(skb))) {
+   key-eth.tci = htons(skb-vlan_tci);
+
+   /* Case where upstream
+* processing has already stripped the outer vlan tag.
+*/
+   if (unlikely(skb-vlan_proto == htons(ETH_P_8021AD))) {
+   if (unlikely(skb-len  sizeof(struct qtag_prefix) +
+   sizeof(__be16))) {
+   key-eth.tci = 0;
+   return 0;
+   }
+
+   if (unlikely(!pskb_may_pull(skb,
+   sizeof(struct qtag_prefix) +
+   sizeof(__be16 {
+   return -ENOMEM;
+   }
+
+   if (likely(qp-eth_type == htons(ETH_P_8021Q))) {
+   key-eth.ctci = qp-tci |
+   htons(VLAN_TAG_PRESENT);
+   __skb_pull(skb, sizeof(struct qtag_prefix));
+   }
+   }
return 0;
+   }
 
-   if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) +
-sizeof(__be16
-   return -ENOMEM;
 
-   qp = (struct qtag_prefix *) skb-data;
-   key-eth.tci = qp-tci | htons(VLAN_TAG_PRESENT);
-   __skb_pull(skb, sizeof(struct qtag_prefix));
+   if (qp-eth_type == htons(ETH_P_8021AD)) {
+   struct qinqtag_prefix *qinqp =
+   (struct qinqtag_prefix *)skb-data;
+
+   if (unlikely(skb-len  sizeof(struct qinqtag_prefix) +
+   sizeof(__be16)))
+   return 0;
+
+   if (unlikely(!pskb_may_pull(skb, sizeof(struct qinqtag_prefix) +
+   sizeof(__be16 {
+   return -ENOMEM;
+   }
+   key-eth.tci = qinqp-tci | htons(VLAN_TAG_PRESENT);
+   key-eth.ctci = qinqp-ctci | htons(VLAN_TAG_PRESENT);
+
+   __skb_pull(skb, sizeof(struct qinqtag_prefix));
+
+   return 0;
+   }
+   if (qp-eth_type == htons(ETH_P_8021Q)) {
+   if (unlikely(skb-len  sizeof(struct qtag_prefix) +
+   sizeof(__be16)))
+   return -ENOMEM;
+
+   if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) +
+   sizeof(__be16
+   return 0;
+   key-eth.tci = qp-tci | htons(VLAN_TAG_PRESENT);
+
+   __skb_pull(skb, sizeof(struct qtag_prefix));
+   }
 
return 0;
 }
@@ -474,9 +531,10 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
 */
 
key-eth.tci = 0;
-   if (skb_vlan_tag_present(skb))
-   key-eth.tci = htons(skb-vlan_tci);
-   else if (eth-h_proto == htons(ETH_P_8021Q))
+   key-eth.ctci = 0;
+   if ((skb_vlan_tag_present(skb)) ||
+   (eth-h_proto == htons(ETH_P_8021Q)) ||
+   (eth-h_proto == htons(ETH_P_8021AD)))
if (unlikely(parse_vlan(skb, key)))
return -ENOMEM;
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a076e44..fa83c61 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -134,6 +134,9 @@ struct sw_flow_key {
u8 src[ETH_ALEN];   /* Ethernet source address. */
u8 dst[ETH_ALEN];   /* Ethernet destination address. */
__be16 tci; /* 0 if no VLAN, VLAN_TAG_PRESENT set 
otherwise. */
+   __be16 

Re: [RFC net-next 1/3] net: infra for per-nexthop encap data

2015-06-02 Thread Eric W. Biederman
Robert Shearman rshea...@brocade.com writes:

 Having to add a new interface to apply encap onto a packet is a
 mechanism that works well today, allowing the setup of the encap to be
 done separately from the routes out of them, meaning that routing
 protocols and other user-space apps don't need to do anything special
 to add routes out of a new type of interface. However, the overhead of
 creating an interface is high, especially in terms of
 memory. Therefore, the traditional method won't work very well for
 large numbers of routes applying encap where there is a low degree of
 sharing of the encap.

 The solution is to introduce a way of defining encap on a per-nexthop
 basis (i.e. per-route if only one nexthop) through the addition of a
 new netlink attribute, RTA_ENCAP. The semantics of this attribute is
 that the data is interpreted according to the output interface type
 (RTA_OIF) and is opaque to the normal forwarding path. The output
 interface doesn't have to be defined per-nexthop, but instead
 represents the way of encapsulating the packet. There could be as few
 as one per namespace, but more could be created, particularly if they
 are used to define parameters which are shared by a large number of
 routes. However, the split of what goes in the encap data and what
 might be specified via interface attributes is entirely up to the
 encap-type implementation.

 New rtnetlink operations are defined to assist with the management of
 this data:
 - parse_encap for parsing the attribute given through rtnl and either
   sizing the in-memory version (if encap ptr is NULL) or filling in the
   in-memory version.  RTA_ENCAP work for IPv4. This operations allows
   the interface to reject invalid encap specified by user-space and the
   sizing allows the kernel to have a different in memory implementation
   to the netlink API (which might be optimised for extensibility rather
   than speed of packet forwarding).
 - fill_encap for taking the in-memory version of the encap and filling
   in an RTA_ENCAP attribute in a netlink message.
 - match_encap for comparing an in-memory version of encap with an
   RTA_ENCAP version, returning 0 if matching or 1 if different.

 A new dst operation is also defined to allow encap-type interfaces to
 retrieve the encap data from their xmit functions and use it for
 encapsulating the packet and for further forwarding.

This bit of infrastructure should be more like rtnl_register.  Where
we register an encap type and the operations to go with it.

Just like rtnl_register we can have small array with the operations for
each supported encapsulation.

Eric

 Suggested-by: Eric W. Biederman ebied...@xmission.com
 Signed-off-by: Robert Shearman rshea...@brocade.com
 ---
  include/linux/rtnetlink.h  |  7 +++
  include/net/dst.h  | 11 +++
  include/net/dst_ops.h  |  2 ++
  include/net/rtnetlink.h| 11 +++
  include/uapi/linux/rtnetlink.h |  1 +
  net/core/rtnetlink.c   | 36 
  6 files changed, 68 insertions(+)

 diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
 index a2324fb45cf4..470d822ddd61 100644
 --- a/include/linux/rtnetlink.h
 +++ b/include/linux/rtnetlink.h
 @@ -22,6 +22,13 @@ struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct 
 net_device *dev,
  void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev,
  gfp_t flags);
  
 +int rtnl_parse_encap(const struct net_device *dev, const struct nlattr *nla,
 +  void *encap);
 +int rtnl_fill_encap(const struct net_device *dev, struct sk_buff *skb,
 + int encap_len, const void *encap);
 +int rtnl_match_encap(const struct net_device *dev, const struct nlattr *nla,
 +  int encap_len, const void *encap);
 +
  
  /* RTNL is used as a global lock for all changes to network configuration  */
  extern void rtnl_lock(void);
 diff --git a/include/net/dst.h b/include/net/dst.h
 index 2bc73f8a00a9..df0e6ec18eca 100644
 --- a/include/net/dst.h
 +++ b/include/net/dst.h
 @@ -506,4 +506,15 @@ static inline struct xfrm_state *dst_xfrm(const struct 
 dst_entry *dst)
  }
  #endif
  
 +/* Get encap data for destination */
 +static inline int dst_get_encap(struct sk_buff *skb, const void **encap)
 +{
 + const struct dst_entry *dst = skb_dst(skb);
 +
 + if (!dst || !dst-ops-get_encap)
 + return 0;
 +
 + return dst-ops-get_encap(dst, encap);
 +}
 +
  #endif /* _NET_DST_H */
 diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
 index d64253914a6a..97f48cf8ef7d 100644
 --- a/include/net/dst_ops.h
 +++ b/include/net/dst_ops.h
 @@ -32,6 +32,8 @@ struct dst_ops {
   struct neighbour *  (*neigh_lookup)(const struct dst_entry *dst,
   struct sk_buff *skb,
   const void *daddr);
 + int (*get_encap)(const struct 

Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread roopa

On 6/2/15, 11:30 AM, Eric W. Biederman wrote:

roopa ro...@cumulusnetworks.com writes:


On 6/1/15, 9:46 AM, Robert Shearman wrote:

In order to be able to function as a Label Edge Router in an MPLS
network, it is necessary to be able to take IP packets and impose an
MPLS encap and forward them out. The traditional approach of setting
up an interface for each tunnel endpoint doesn't scale for the
common MPLS use-cases where each IP route tends to be assigned a
different label as encap.

The solution suggested here for further discussion is to provide the
facility to define encap data on a per-nexthop basis using a new
netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
forwarding code, but interpreted by the virtual interface assigned to
the nexthop.

A new ipmpls interface type is defined to show the use of this
facility to allow IP packets to be imposed with an MPLS
encap. However, the facility is designed to be general enough to be
used by any encapsulation/tunneling mechanism that has similar
requirements of high-scale, high-variation-of-encap.

RFC because:
   - IPv6 side not implemented
   - struct rtable shouldn't be bloated by pointer+uint
   - Hasn't been thoroughly tested yet

Robert Shearman (3):
net: infra for per-nexthop encap data
ipv4: storing and retrieval of per-nexthop encap
mpls: new ipmpls device for encapsulating IP packets as mpls



Glad to see these patches!.
I have a similar series i have been working on...but no netdevice.
A set of ops similar to iptun_encaps and I store encap data in fib_nh
and in ip_route_output_slow i point the dst.output to the output func provided
by one of the encap ops.

I see the advantages of using a netdevice...and i see this align with patches
from thomas.

roopa I think I would prefer your patches.  I thinking using a netdevice
the way Robert is proposing is quite possibly a mess, from a scalability
stand point.

Do you mean ip_route_input_slow?  There is no ip_route_output_slow.
yes, correct, sorry. I mean ip_route_input_slow. They need work but i 
will try to get them out today to add more context to the discussion.


thanks,
Roopa


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls

2015-06-02 Thread roopa

On 6/2/15, 9:33 AM, Robert Shearman wrote:

On 02/06/15 17:15, roopa wrote:

On 6/1/15, 9:46 AM, Robert Shearman wrote:

Allow creating an mpls device for the purposes of encapsulating IP
packets with:

   ip link add type ipmpls

This device defines its per-nexthop encapsulation data as a stack of
labels, in the same format as for RTA_NEWST. It uses the encap data
which will have been stored in the IP route to encapsulate the packet
with that stack of labels, with the last label corresponding to a
local label that defines how the packet will be sent out. The device
sends packets over loopback to the local MPLS forwarding logic which
performs all of the work.



Maybe a silly question, but when you loop the packet back, what does the
local MPLS forwarding logic
lookup with ? It probably assumes there is a mpls route with that label
and nexthop.
Will this need any internal labels (thinking same label stack different
tunnel device etc) ?


Yes, it requires that local/internal labels have been allocated and 
label routes installed in the label table for them.

This is our only concern.


It is entirely possible to put the outgoing interface into the encap 
data to avoid having to allocate extra labels, 
but I did it this way in order to support PIC Core for MPLS-VPN routes.


hmm..., is a netdevice must in this case.., can you please elaborate on 
this ?.




Note: I have two extra patches which avoid using the loopback device 
(which causes the TTL to end up being one less than it should on 
output), but I haven't posted them here because they were dependent on 
other mpls changes in my tree.


ok, thanks.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/25] Convert the posix_clock_operations and k_clock structure to ready for 2038

2015-06-02 Thread Thomas Gleixner
On Mon, 1 Jun 2015, Baolin Wang wrote:

 This patch series changes the 32-bit time types (timespec/itimerspec) to
 the 64-bit types (timespec64/itimerspec64), since 32-bit time types will
 break in the year 2038.

That's only true for 32bit systems.

All in all the patch series looks rather reasonable now, except for
the subject lines and the changelogs.

The only technical objection I have is the macro conversion magic in
patch #6. This can be done in a less cryptic and more efficient way.

See the comments to the various patches and please apply them to all
of the series.

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Robert Shearman

On 02/06/15 19:11, Eric W. Biederman wrote:

Robert Shearman rshea...@brocade.com writes:


In order to be able to function as a Label Edge Router in an MPLS
network, it is necessary to be able to take IP packets and impose an
MPLS encap and forward them out. The traditional approach of setting
up an interface for each tunnel endpoint doesn't scale for the
common MPLS use-cases where each IP route tends to be assigned a
different label as encap.

The solution suggested here for further discussion is to provide the
facility to define encap data on a per-nexthop basis using a new
netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
forwarding code, but interpreted by the virtual interface assigned to
the nexthop.

A new ipmpls interface type is defined to show the use of this
facility to allow IP packets to be imposed with an MPLS
encap. However, the facility is designed to be general enough to be
used by any encapsulation/tunneling mechanism that has similar
requirements of high-scale, high-variation-of-encap.


I am still digging into the details but adding a new network device to
make this possible if very undesirable.

It is a pain point.  Those network devices get to be a major source of
memory consumption when there are 4K network namespaces in existence.

It is conceptually wrong.  The network device will never be used as an
ordinary network device.  All the network device gives you is the
ability to avoid creating an enumeration of different kinds of
encapsulation.


This isn't true. The network device also gives some of the things you 
take for granted. Things like fragmentation through specifying the mtu 
on the shared tunnel device, being able to specify rules using the 
shared tunnel output device, IP stats, and the ability specify a 
different destination namespace.


Thanks,
Rob
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Eric W. Biederman
Robert Shearman rshea...@brocade.com writes:

 On 02/06/15 19:11, Eric W. Biederman wrote:
 Robert Shearman rshea...@brocade.com writes:

 In order to be able to function as a Label Edge Router in an MPLS
 network, it is necessary to be able to take IP packets and impose an
 MPLS encap and forward them out. The traditional approach of setting
 up an interface for each tunnel endpoint doesn't scale for the
 common MPLS use-cases where each IP route tends to be assigned a
 different label as encap.

 The solution suggested here for further discussion is to provide the
 facility to define encap data on a per-nexthop basis using a new
 netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
 forwarding code, but interpreted by the virtual interface assigned to
 the nexthop.

 A new ipmpls interface type is defined to show the use of this
 facility to allow IP packets to be imposed with an MPLS
 encap. However, the facility is designed to be general enough to be
 used by any encapsulation/tunneling mechanism that has similar
 requirements of high-scale, high-variation-of-encap.

 I am still digging into the details but adding a new network device to
 make this possible if very undesirable.

 It is a pain point.  Those network devices get to be a major source of
 memory consumption when there are 4K network namespaces in existence.

 It is conceptually wrong.  The network device will never be used as an
 ordinary network device.  All the network device gives you is the
 ability to avoid creating an enumeration of different kinds of
 encapsulation.

 This isn't true. The network device also gives some of the things you
 take for granted. Things like fragmentation through specifying the mtu
 on the shared tunnel device, being able to specify rules using the
 shared tunnel output device, IP stats, and the ability specify a
 different destination namespace.

Granted you get a few more things.  It is still conceptually wrong as
the network device will netver be used as an ordinary network device.

Fragmentation is already silly because we are talking about multiple
tunnels with different properties.  You need per-route mtu to handle
that case.

Further I am not saying you don't need an output device (which is what
is needed to specify a different destination namespace) I am saying that
having a funny mpls device is wrong as far as I can see.  Certainly it
is a lot of bloody unnecessary overhead.

If we are going to design for maximum scaling (and 1 million+ routes)
sounds like maximum scaling we should see how far we can go without
dragging in the horrible heaviness of additional network devices.  35K a
piece last I measured it.  Just a small handful of them are already
scaling issues for network namespaces.

Eric

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ray_cs: Change 1 to true for bool type variable.

2015-06-02 Thread Kalle Valo

 The variable translate is bool type. So assigning true instead of 1.
 
 Signed-off-by: Shailendra Verma shailendra.capric...@gmail.com

Thanks, applied to wireless-drivers-next.git.

Kalle Valo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/2] Renesas Ethernet AVB driver proper

2015-06-02 Thread Sergei Shtylyov
Ethernet AVB includes an Gigabit Ethernet controller (E-MAC) that is basically
compatible with SuperH Gigabit Ethernet E-MAC.  Ethernet AVB has  a  dedicated
direct memory access controller (AVB-DMAC) that is a new design compared to the
SuperH E-DMAC. The AVB-DMAC is compliant with 3 standards formulated for IEEE
802.1BA: IEEE 802.1AS timing and synchronization protocol, IEEE 802.1Qav real-
time transfer, and the IEEE 802.1Qat stream reservation protocol.

The  driver only supports device tree probing, so the binding document is
included in this patch.

Based on the original patches by Mitsuhiro Kimura.

Signed-off-by: Mitsuhiro Kimura mitsuhiro.kimura...@renesas.com
Signed-off-by: Sergei Shtylyov sergei.shtyl...@cogentembedded.com

---
This patch is against David Miller's 'net-next.git' repo.

Changes in version 5:
- switched to calling multiqueue APIs, implementing ndo_select_queue() method;
- fixed the ring-full check in ravb_start_xmit() and turned it into a sanity
  check, adjusting the error message and promoting dev_warn() to dev_err();
- fixed skb_put_padto() failure path to drop the packet;
- moved the 'priv-cur_tx' increment after enqueuing the packet in
  ravb_start_xmit();
- added the ring-full check after the packet gets enqueued in ravb_start_xmit(),
  moved ravb_tx_free() call from the sanity check to this new code; 
- moved mmiowb() call after the 'exit' label in ravb_start_xmit();
- removed superfluous netif_tx_disable() call from ravb_set_ringparam().

Changes in version 4:
- switched from the bit fields in the descriptor structures to normal fields
  and masks;
- declared 16/32-bit descriptor fields as '__le{16/32}' and started using
  cpu_to_le{16|32}() and le{16|32}_to_cpu() when accessing them;
- started registering/deregistering the PTP driver each time the AVB-DMAC
  enters/leaves  the operation mode instead of navb_{probe|remove}(), and thus
  removed ravb_ptp_is_config() and also the checks in ravb_ptp_tcr_request(),
  ravb_ptp_time_write(), and ravb_ptp_update_addend();
- folded ravb_free_dma_buffer() into ravb_ring_free(), clarified the comment to
  the latter function; was then able to simplify the error cleanup path in
  ravb_ring_init();
- fixed totally brain damaged ravb_tx_timeout() by first stopping DMA and then
  calling ravb_ring_free() and moving most of the code into work-queue function,
- started calling ravb_ring_init() from ravb_dmac_init();
- started allocating the TX buffers with GPF_KERNEL instead of GPF_ATOMIC;
- started checking the result of ravb_wait() where it was previously ignored;
- propagated errors from ravb_wait() calls outside ravb_ptp_tcr_request();
- propagated errors from ravb_ptp_tcr_request() calls outside
  ravb_ptp_time_{read|write}();
- propagated errors from ravb_ptp_time_{read|write}() outside
  ravb_ptp_adjtime() and ravb_ptp_{get|set}time64();
- switched from using ravb_wait() after setting a bit in the GCCR register to
  just checking if a bit is zero before setting it in ravb_ptp_time_write(),
  ravb_ptp_select_counter(), and ravb_ptp_update_addend();
- added check for ravb_ptp_update_compare() failure to ravb_ptp_perout() and
  propagate the error outside this function;
- added mmiowb() calls before releasing the spinlock;
- merged the spinlock release code from different *if* statement branches in
  ravb_ptp_ptp_perout();
- fixed the 'reg' parameter type in ravb_wait();
- fixed the result type for ravb_start_xmit();
- fixed the 'request' parameter type in ravb_ptp_tcr_request();
- added the 'size' local variable in ravb_tx_free();
- fixed TX ring cleanup threshold in ravb_start_xmit();
- fixed kmalloc() error cleanup in ravb_start_xmit();
- fixed ravb_start_xmit() racing with the interrupt handler and NAPI poller by
  holding spinlock till the end of ravb_start_xmit();
- factored out the GIC interrupt handling into ravb_ptp_interrupt();
- added the 'dma_addr' local variable in ravb_start_xmit();
- switched from '%=' operator to '=' in ravb_start_xmit();
- changed the format specifier in ravb_tx_timeout();
- removed useless type cast from ravb_rx();
- acquired the spinlock earlier in ravb_poll();
- made ravb_ptp_init() return *void* since its result isn't checked anyway;
- removed 'ravb_private::rx_buffer_size' since the RX buffer size should be
  constant;
- removed unused 'ravb_private::edmac_endian';
- expanded update_mac_address() inline at its only call site;
- expanded ravb_ptp_cnt_{read|write}() inline at their single call sites;
- expanded ravb_ptp_cnt_select_counter() inline at its only call site;
- expanded ravb_ptp_update_addend() at its only call site;
- added 'ravb_' prefix to read_mac_address();
- renamed ravb_wait_stop_dma() to ravb_stop_dma();
- also disabled E-MAC TX in ravb_stop_dma() by calling ravb_rcv_snd_disable();
- moved the ravb_config() call from ravb_stop_dma() callers to this function;
- removed redundant register reads/writes in ravb_set_duplex();
- converted the 'new_state' local variable from *int* to 

[PATCH v5 2/2] Renesas Ethernet AVB PTP clock driver

2015-06-02 Thread Sergei Shtylyov
Ethernet AVB device includes the gPTP  timer, so we can implement a PTP clock
driver.  We're doing that in a separate file, with  the main Ethernet driver
calling the PTP driver's [de]initialization and interrupt handler functions.
Unfortunately, the clock seems tightly coupled with the AVB-DMAC, so when that
one leaves the operation mode, we have to unregister the PTP clock... :-(

Based on the original patches by Masaru Nagai.

Signed-off-by: Masaru Nagai masaru.nagai...@renesas.com
Signed-off-by: Sergei Shtylyov sergei.shtyl...@cogentembedded.com

---
This patch is against David Miller's 'net-next.git' repo.

Changes in version 5:
- resolved rejects, refreshed the patch.

Changes in version 4:
- new patch, split from the main Ethernet driver patch.

 drivers/net/ethernet/renesas/Makefile   |2 
 drivers/net/ethernet/renesas/ravb.c |   33 ++
 drivers/net/ethernet/renesas/ravb.h |   26 ++
 drivers/net/ethernet/renesas/ravb_ptp.c |  357 
 4 files changed, 412 insertions(+), 6 deletions(-)

Index: net-next/drivers/net/ethernet/renesas/Makefile
===
--- net-next.orig/drivers/net/ethernet/renesas/Makefile
+++ net-next/drivers/net/ethernet/renesas/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_SH_ETH) += sh_eth.o
-obj-$(CONFIG_RAVB) += ravb.o
+obj-$(CONFIG_RAVB) += ravb.o ravb_ptp.o
Index: net-next/drivers/net/ethernet/renesas/ravb.c
===
--- net-next.orig/drivers/net/ethernet/renesas/ravb.c
+++ net-next/drivers/net/ethernet/renesas/ravb.c
@@ -28,7 +28,6 @@
 #include linux/of_irq.h
 #include linux/of_mdio.h
 #include linux/of_net.h
-#include linux/platform_device.h
 #include linux/pm_runtime.h
 #include linux/slab.h
 #include linux/spinlock.h
@@ -41,8 +40,7 @@
 NETIF_MSG_RX_ERR | \
 NETIF_MSG_TX_ERR)
 
-static int ravb_wait(struct net_device *ndev, enum ravb_reg reg, u32 mask,
-u32 value)
+int ravb_wait(struct net_device *ndev, enum ravb_reg reg, u32 mask, u32 value)
 {
int i;
 
@@ -785,6 +783,9 @@ static irqreturn_t ravb_interrupt(int ir
result = IRQ_HANDLED;
}
 
+   if (iss  ISS_CGIS)
+   result = ravb_ptp_interrupt(ndev);
+
mmiowb();
spin_unlock(priv-lock);
return result;
@@ -1124,6 +1125,8 @@ static int ravb_set_ringparam(struct net
 
if (netif_running(ndev)) {
netif_device_detach(ndev);
+   /* Stop PTP Clock driver */
+   ravb_ptp_stop(ndev);
/* Wait for DMA stopping */
error = ravb_stop_dma(ndev);
if (error) {
@@ -1153,6 +1156,9 @@ static int ravb_set_ringparam(struct net
 
ravb_emac_init(ndev);
 
+   /* Initialise PTP Clock driver */
+   ravb_ptp_init(ndev, priv-pdev);
+
netif_device_attach(ndev);
}
 
@@ -1162,6 +1168,8 @@ static int ravb_set_ringparam(struct net
 static int ravb_get_ts_info(struct net_device *ndev,
struct ethtool_ts_info *info)
 {
+   struct ravb_private *priv = netdev_priv(ndev);
+
info-so_timestamping =
SOF_TIMESTAMPING_TX_SOFTWARE |
SOF_TIMESTAMPING_RX_SOFTWARE |
@@ -1174,7 +1182,7 @@ static int ravb_get_ts_info(struct net_d
(1  HWTSTAMP_FILTER_NONE) |
(1  HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
(1  HWTSTAMP_FILTER_ALL);
-   info-phc_index = -1;
+   info-phc_index = ptp_clock_index(priv-ptp.clock);
 
return 0;
 }
@@ -1215,15 +1223,21 @@ static int ravb_open(struct net_device *
goto out_free_irq;
ravb_emac_init(ndev);
 
+   /* Initialise PTP Clock driver */
+   ravb_ptp_init(ndev, priv-pdev);
+
netif_tx_start_all_queues(ndev);
 
/* PHY control start */
error = ravb_phy_start(ndev);
if (error)
-   goto out_free_irq;
+   goto out_ptp_stop;
 
return 0;
 
+out_ptp_stop:
+   /* Stop PTP Clock driver */
+   ravb_ptp_stop(ndev);
 out_free_irq:
free_irq(ndev-irq, ndev);
 out_napi_off:
@@ -1254,6 +1268,9 @@ static void ravb_tx_timeout_work(struct
 
netif_tx_stop_all_queues(ndev);
 
+   /* Stop PTP Clock driver */
+   ravb_ptp_stop(ndev);
+
/* Wait for DMA stopping */
ravb_stop_dma(ndev);
 
@@ -1264,6 +1281,9 @@ static void ravb_tx_timeout_work(struct
ravb_dmac_init(ndev);
ravb_emac_init(ndev);
 
+   /* Initialise PTP Clock driver */
+   ravb_ptp_init(ndev, priv-pdev);
+
netif_tx_start_all_queues(ndev);
 }
 
@@ -1428,6 +1448,9 @@ static int ravb_close(struct net_device
ravb_write(ndev, 0, RIC2);
ravb_write(ndev, 0, TIC);
 
+   /* Stop PTP Clock driver */
+   ravb_ptp_stop(ndev);
+
/* Set the config mode to stop the 

[PATCH net-next 3/3] net/mlx4_core: fix typo in mlx4_set_vf_mac

2015-06-02 Thread clsoto
From: Carol Soto cls...@linux.vnet.ibm.com

fix typo in mlx4_set_vf_mac

Acked-by: Or Gerlitz ogerl...@mellanox.com
Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 91d8344..68ae765 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2917,7 +2917,7 @@ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int 
vf, u64 mac)
port = mlx4_slaves_closest_port(dev, slave, port);
s_info = priv-mfunc.master.vf_admin[slave].vport[port];
s_info-mac = mac;
-   mlx4_info(dev, default mac on vf %d port %d to %llX will take afect 
only after vf restart\n,
+   mlx4_info(dev, default mac on vf %d port %d to %llX will take effect 
only after vf restart\n,
  vf, port, s_info-mac);
return 0;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/3] net/mlx4_core: need to call close fw if alloc icm is called twice

2015-06-02 Thread clsoto
From: Carol Soto cls...@linux.vnet.ibm.com

If mlx4_enable_sriov is called by adapter without this
feature MLX4_DEV_CAP_FLAG2_SYS_EQS then during this path the function alloc
icm is called twice without freeing the structures from the first time.

Acked-by: Or Gerlitz ogerl...@mellanox.com
Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 9485cbe..7d5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2976,6 +2976,7 @@ slave_start:
  existing_vfs,
  reset_flow);
 
+   mlx4_close_fw(dev);
mlx4_cmd_cleanup(dev, MLX4_CMD_CLEANUP_ALL);
dev-flags = dev_flags;
if (!SRIOV_VALID_STATE(dev-flags)) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/3] net/mlx4_core: double free of dev_vfs

2015-06-02 Thread clsoto
From: Carol L Soto cls...@linux.vnet.ibm.com

If user loads mlx4_core with num_vfs greater than
supported then variable dev-dev_vfs is freed 2 times after unloading the
driver.

Acked-by: Or Gerlitz ogerl...@mellanox.com
Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 0dbd704..9485cbe 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2824,6 +2824,7 @@ disable_sriov:
 free_mem:
dev-persist-num_vfs = 0;
kfree(dev-dev_vfs);
+dev-dev_vfs = NULL;
return dev_flags  ~MLX4_FLAG_MASTER;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v3 00/15] sfc: ndo_get_phys_port_id, vadaptor stats and PF unload when Vf's assigned to guest

2015-06-02 Thread David Miller
From: Shradha Shah ss...@solarflare.com
Date: Tue, 2 Jun 2015 11:36:00 +0100

 This is the third and last instalment of SRIOV for EF10 patches.
 
 This patch set includes implementation of ndo_get_phys_port_id
 and changes to the MAC statistics code in order to support
 vadaptor statistics.
 
 It also includes code to deal with PF unload when Vf's are still
 assigned to the guest.
 
 The first couple of patches create sysfs files for physical port
 and link control flags which are particularly useful when we have
 enabled a large number of VF's.
 
 These patches have been tested with and without CONFIG_SFC_SRIOV.
 The creation and content of the sysfs files has been tested.
 The statistics are tested using ethtool for monitoring.

Series applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls

2015-06-02 Thread Robert Shearman

On 02/06/15 19:57, roopa wrote:

On 6/2/15, 9:33 AM, Robert Shearman wrote:

On 02/06/15 17:15, roopa wrote:

On 6/1/15, 9:46 AM, Robert Shearman wrote:

Allow creating an mpls device for the purposes of encapsulating IP
packets with:

   ip link add type ipmpls

This device defines its per-nexthop encapsulation data as a stack of
labels, in the same format as for RTA_NEWST. It uses the encap data
which will have been stored in the IP route to encapsulate the packet
with that stack of labels, with the last label corresponding to a
local label that defines how the packet will be sent out. The device
sends packets over loopback to the local MPLS forwarding logic which
performs all of the work.



Maybe a silly question, but when you loop the packet back, what does the
local MPLS forwarding logic
lookup with ? It probably assumes there is a mpls route with that label
and nexthop.
Will this need any internal labels (thinking same label stack different
tunnel device etc) ?


Yes, it requires that local/internal labels have been allocated and
label routes installed in the label table for them.

This is our only concern.


It is entirely possible to put the outgoing interface into the encap
data to avoid having to allocate extra labels, but I did it this way
in order to support PIC Core for MPLS-VPN routes.


hmm..., is a netdevice must in this case.., can you please elaborate on
this ?.


Yes, the ipmpls device would still be used to perform the encapsulation, 
transitioning from the IP forwarding path to the MPLS forwarding path.


Thanks,
Rob
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Hannes Frederic Sowa
On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote:
 As far as I can tell, enabling IP_RECVERR causes the presence of a
 queued error to cause recvmsg, etc to return an error (once).  It's
 worse, though: a new error can be queued asynchronously at any time,
 this setting sk_err to a nonzero value.  How do I sensibly distinguish
 recvmsg failures to to genuine errors receiving messages from recvmsg
 failures because there's a queued error?
 
 The only way I can see to get reliable error handling is to literally
 call recvmsg in a loop:
 
 while (true /* or while POLLIN is set */) {
   int ret = recvmsg(..., MSG_ERRQUEUE not set);
   if (ret  0  /* what goes here? */) {
 whoops!  this might be a harmless asynchronous error!
 take no action!
   }

I see either two possibilities:

We export the icmp_err_convert tables along with the udp_lib_err error
conversions to user space and spice them up with flags to mark if they
are transient (icmp_err_convert already has a fatal flag).

Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after
you got a ret  0 when calling without MSG_ERRQUEUE and inspect the
sock_extended_err, no?

Bye,
Hannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Hannes Frederic Sowa
On Wed, Jun 3, 2015, at 02:03, Andy Lutomirski wrote:
 On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa
 han...@stressinduktion.org wrote:
  My proposal would be to make the error conversion lazy:
 
  Keeping duplicate data is not a good idea in general: So we shouldn't
  use sk-sk_err if IP_RECVERR is set at all but let sock_error just use
  the sk_error_queue and extract the error code from there.
 
  Only if IP_RECVERR was not set, we use sk-sk_err logic.
 
  What do you think?
 
  I just noticed that this will probably break existing user space
  applications which require that icmp errors are transient even with
  IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer
  and xchg the pointer, hmmm
 
 Do you mean to fix the race like this but to otherwise leave the
 semantics
 alone?  That would be an improvement, but it might be nice to also add
 a non-crappy API for this, too.

Yes, keep current semantics but fix the race you reported.

I currently don't have good proposals for a decent API to handle this
besides adding some ancillary cmsg data to msg_control. This still would
not solve the problem fundamentally, as a -EFAULT/-EINVAL return value
could also mean that msg_control should not be touched, thus we end up
again relying on errno checking. :/ Thus checking error queue after
receiving an error indications is my best hunch so far.

Your proposal with MSG_IGNORE_ERROR seems reasonable so far for ping or
udp, but I haven't fully grasped the TCP semantics of sk-sk_err, yet.

Bye,
Hannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/9] net: dsa: add basic support for VLAN operations

2015-06-02 Thread Vivien Didelot
Hi Guenter,

On Jun 2, 2015, at 10:42 AM, Guenter Roeck li...@roeck-us.net wrote:
On 06/01/2015 06:27 PM, Vivien Didelot wrote:
 This patch adds the glue between DSA and switchdev to add and delete
 SWITCHDEV_OBJ_PORT_VLAN objects.

 This will allow the DSA switch drivers implementing the port_vlan_add
 and port_vlan_del functions to access the switch VLAN database through
 userspace commands such as bridge vlan.

 Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
 ---
   include/net/dsa.h |  7 +++
   net/dsa/slave.c   | 61 
 +--
   2 files changed, 66 insertions(+), 2 deletions(-)

 diff --git a/include/net/dsa.h b/include/net/dsa.h
 index fbca63b..726357b 100644
 --- a/include/net/dsa.h
 +++ b/include/net/dsa.h
 @@ -302,6 +302,13 @@ struct dsa_switch_driver {
 const unsigned char *addr, u16 vid);
  int (*fdb_getnext)(struct dsa_switch *ds, int port,
 unsigned char *addr, bool *is_static);
 +
 +/*
 + * VLAN support
 + */
 +int (*port_vlan_add)(struct dsa_switch *ds, int port, u16 vid,
 + u16 bridge_flags);
 +int (*port_vlan_del)(struct dsa_switch *ds, int port, u16 vid);
   };

   void register_switch_driver(struct dsa_switch_driver *type);
 diff --git a/net/dsa/slave.c b/net/dsa/slave.c
 index cbda00a..52ba5a1 100644
 --- a/net/dsa/slave.c
 +++ b/net/dsa/slave.c
 @@ -363,6 +363,25 @@ static int dsa_slave_port_attr_set(struct net_device 
 *dev,
  return ret;
   }

 +static int dsa_slave_port_vlans_add(struct net_device *dev,
 +struct switchdev_obj_vlan *vlan)
 +{
 +struct dsa_slave_priv *p = netdev_priv(dev);
 +struct dsa_switch *ds = p-parent;
 +int vid, err = 0;
 +
 +if (!ds-drv-port_vlan_add)
 +return -ENOTSUPP;
 +
 +for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) {
 +err = ds-drv-port_vlan_add(ds, p-port, vid, vlan-flags);
 +if (err)
 +break;
 +}
 +
 +return err;
 +}
 +
   static int dsa_slave_port_obj_add(struct net_device *dev,
struct switchdev_obj *obj)
   {
 @@ -378,6 +397,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
  return 0;

  switch (obj-id) {
 +case SWITCHDEV_OBJ_PORT_VLAN:
 +err = dsa_slave_port_vlans_add(dev, obj-u.vlan);
 +break;
  default:
  err = -ENOTSUPP;
  break;
 @@ -386,12 +408,34 @@ static int dsa_slave_port_obj_add(struct net_device 
 *dev,
  return err;
   }

 +static int dsa_slave_port_vlans_del(struct net_device *dev,
 +struct switchdev_obj_vlan *vlan)
 +{
 +struct dsa_slave_priv *p = netdev_priv(dev);
 +struct dsa_switch *ds = p-parent;
 +int vid, err = 0;
 +
 +if (!ds-drv-port_vlan_del)
 +return -ENOTSUPP;
 +
 +for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) {
 +err = ds-drv-port_vlan_del(ds, p-port, vid);
 +if (err)
 +break;
 +}
 +
 +return err;
 +}
 +
   static int dsa_slave_port_obj_del(struct net_device *dev,
struct switchdev_obj *obj)
   {
  int err;

  switch (obj-id) {
 +case SWITCHDEV_OBJ_PORT_VLAN:
 +err = dsa_slave_port_vlans_del(dev, obj-u.vlan);
 +break;
  default:
  err = -EOPNOTSUPP;
  break;
 @@ -473,6 +517,15 @@ static netdev_tx_t dsa_slave_notag_xmit(struct sk_buff
 *skb,
  return NETDEV_TX_OK;
   }

 +static int dsa_slave_vlan_noop(struct net_device *dev, __be16 proto, u16 
 vid)
 +{
 +/* NETIF_F_HW_VLAN_CTAG_FILTER requires ndo_vlan_rx_add_vid and
 + * ndo_vlan_rx_kill_vid, otherwise the VLAN acceleration is considered
 + * buggy (see net/core/dev.c).
 + */
 +return 0;
 +}
 +

   /* ethtool operations 
 ***/
   static int
 @@ -734,6 +787,10 @@ static const struct net_device_ops dsa_slave_netdev_ops 
 = {
  .ndo_fdb_dump   = dsa_slave_fdb_dump,
  .ndo_do_ioctl   = dsa_slave_ioctl,
  .ndo_get_iflink = dsa_slave_get_iflink,
 +.ndo_vlan_rx_add_vid= dsa_slave_vlan_noop,
 +.ndo_vlan_rx_kill_vid   = dsa_slave_vlan_noop,
 +.ndo_bridge_setlink = switchdev_port_bridge_setlink,
 +.ndo_bridge_dellink = switchdev_port_bridge_dellink,
   };

   static const struct switchdev_ops dsa_slave_switchdev_ops = {
 @@ -924,7 +981,7 @@ int dsa_slave_create(struct dsa_switch *ds, struct device
 *parent,
  if (slave_dev == NULL)
  return -ENOMEM;

 -slave_dev-features = master-vlan_features;
 +slave_dev-features = master-vlan_features | NETIF_F_VLAN_FEATURES;
 
 Hi Vivien,
 
 NETIF_F_VLAN_FEATURES declares that the device supports receive and transmit
 tagging offload. We do this on transmit, by calling vlan_hwaccel_push_inside()
 with patch 9, but not on the receive side.
 
 I think you may need to add matching code on the receive side to remove
 the VLAN 

Re: [PATCH v2 net-next] vlan: Add GRO support for non hardware accelerated vlan

2015-06-02 Thread Simon Horman
On Mon, Jun 01, 2015 at 02:56:25PM -0700, David Miller wrote:
 From: Eric Dumazet eric.duma...@gmail.com
 Date: Mon, 01 Jun 2015 07:12:37 -0700
 
  Can we ensure offload_base contains a sensible order of expected
  types ?
 
 This seemed easy enough to kill, so I pushed the following into net-next:
 
 
 [PATCH] net: Add priority to packet_offload objects.
 
 When we scan a packet for GRO processing, we want to see the most
 common packet types in the front of the offload_base list.
 
 So add a priority field so we can handle this properly.
 
 IPv4/IPv6 get the highest priority with the implicit zero priority
 field.
 
 Next comes ethernet with a priority of 10, and then we have the MPLS
 types with a priority of 15.

FWIW I have no objections to the priority assigned to MPLS.

 Suggested-by: Eric Dumazet eric.duma...@gmail.com
 Suggested-by: Toshiaki Makita makita.toshi...@lab.ntt.co.jp
 Signed-off-by: David S. Miller da...@davemloft.net
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/3] DSA and Marvell 88E6352 802.1q support

2015-06-02 Thread Vivien Didelot
Hi Scott,

On Jun 2, 2015, at 2:18 AM, Scott Feldman sfel...@gmail.com wrote:
 On Mon, Jun 1, 2015 at 5:18 PM, Vivien Didelot
 vivien.dide...@savoirfairelinux.com wrote:

 On May 29, 2015, at 1:02 AM, Scott Feldman sfel...@gmail.com wrote:
 On Thu, May 28, 2015 at 2:37 PM, Vivien Didelot
 vivien.dide...@savoirfairelinux.com wrote:
 This RFC is based on v4.1-rc3.

 It is meant to get a glance to the commits responsible to implement the
 necessary NDOs between DSA and the Marvell 88E6352 switch driver.

 With this support, I am able to create VLANs with (un)tagged ports, setting
 their default VID, from a bridge.

 To create a bridge containing all switch ports, with a VLAN ID 400, swp2 
 and
 swp3 untagged (pvid), and swp4 tagged, the userspace commands look like 
 this:

 ip link add name br0 type bridge
 [...]
 ip link set dev swp2 up master br0
 [...]
 bridge vlan add vid 400 pvid untagged dev swp2
 bridge vlan add vid 400 pvid untagged dev swp3
 bridge vlan add vid 400 dev swp4
 [...]
 ip link add link br0 name br0.400 type vlan id 400
 [...]
 bridge vlan add dev br0 vid 400 self

 The code is currently being rebased to the latest net-next/master.

 Seems like the way to go now is through switchdev attr getter/setter...

 Indeed, for dsa_slave you should be able to port this to switchdev and
 set your ndo_bridge_setlink/dellink handlers to
 switchdev_port_bridge_setlink/dellink.  (And also implement the
 switchdev ops for vlans).

 If you use switchdev_port_bridge_setlink/dellink, you shouldn't need
 to implement ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid at all.

 Scott,

 In fact I have to define these ndo, otherwise I get the Buggy VLAN
 acceleration in driver! warning from net/core/dev.c and the switch
 ports won't register.

 I'm actually defining a noop function for them in dsa_slave_netdev_ops.

 Is it correct to set NETIF_F_HW_VLAN_CTAG_FILTER in slave_dev-features?
 
 If your nooping ndo VLAN ops then just remove setting
 NETIF_F_HW_VLAN_CTAG_FILTER and then you can remove the noop funcs.
 
 The setlink/dellink callbacks will give the same info (and more, e.g.
 pvid, untagged flags) and you'll automatically get support for stacked
 drivers, for example if you bonded swp2/3 and then included that bond
 in your vlan bridge.  Your commands will be slightly modified: when
 adding the vid to the port, specify master and self:

 bridge vlan add vid 400 dev swp4 master self

 Thanks it works! Now the switch VLAN database is consistent with the
 bridge commands, I'm sending a complete RFC very soon.

 Scott, David,

 I use this mail to expose a potential problem between iproute2 and the
 kernel, found with my previous code. When issuing ip link set dev swp0
 master br0, ndo_vlan_rx_add_vid is called, but not ndo_bridge_setlink,
 
 Remove NETIF_F_HW_VLAN_CTAG_FILTER and ndo_vlan_rx_add_vid will not be called.
 
 Issuing ip link set dev swp0 master br0 should only be setting the
 bridge member, not setting up any VLAN.  I suspect when you did this
 swp0 was admin UP and you're getting untagged VLAN 0 installed, which
 is the call to ndo_vlan_rx_add_vid.
 
 which results in an inconsistency between my switch VLAN database (and
 port settings) and bridge vlan, which shows swp0   1 PVID Egress
 Untagged.
 
 So that is a result of /sys/class/net/br0/bridge/default_pvid set to
 1.  If you don't want that, turn default_pvid off:
 
 echo 0 /sys/class/net/br0/bridge/default_pvid
 
 Now you'll see None in the bridge vlan output.
 
 Seems like there is a call to ndo_bridge_setlink to add somewhere, but I
 have no clue where.

 In the meantime, I call bridge vlan add vid 1 dev swp0 pvid untagged
 [master self] at boot, to be consistent with the bridge output.
 
 Or turn off default_pvid.

Thanks, I confirm both fixes work. Thanks a lot.
-v
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 2/2] pci: Add VPD quirk for Intel Ethernet devices

2015-06-02 Thread Mark D Rustad
This quirk sets the PCI_DEV_FLAGS_VPD_REF_F0 flag on all Intel
Ethernet device functions other than function 0.

Signed-off-by: Mark Rustad mark.d.rus...@intel.com
---
 drivers/pci/quirks.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index c6dc1dfd25d5..9ddf6a533f4f 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -1903,6 +1903,15 @@ static void quirk_netmos(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NETMOS, PCI_ANY_ID,
 PCI_CLASS_COMMUNICATION_SERIAL, 8, quirk_netmos);
 
+static void quirk_f0_vpd_link(struct pci_dev *dev)
+{
+   if (!PCI_FUNC(dev-devfn))
+   return;
+   dev-dev_flags |= PCI_DEV_FLAGS_VPD_REF_F0;
+}
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
+ PCI_CLASS_NETWORK_ETHERNET, 8, quirk_f0_vpd_link);
+
 static void quirk_e100_interrupt(struct pci_dev *dev)
 {
u16 command, pmcsr;

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Andy Lutomirski
On Tue, Jun 2, 2015 at 5:33 PM, Hannes Frederic Sowa
han...@stressinduktion.org wrote:
 On Wed, Jun 3, 2015, at 02:03, Andy Lutomirski wrote:
 On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa
 han...@stressinduktion.org wrote:
  My proposal would be to make the error conversion lazy:
 
  Keeping duplicate data is not a good idea in general: So we shouldn't
  use sk-sk_err if IP_RECVERR is set at all but let sock_error just use
  the sk_error_queue and extract the error code from there.
 
  Only if IP_RECVERR was not set, we use sk-sk_err logic.
 
  What do you think?
 
  I just noticed that this will probably break existing user space
  applications which require that icmp errors are transient even with
  IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer
  and xchg the pointer, hmmm

 Do you mean to fix the race like this but to otherwise leave the
 semantics
 alone?  That would be an improvement, but it might be nice to also add
 a non-crappy API for this, too.

 Yes, keep current semantics but fix the race you reported.

 I currently don't have good proposals for a decent API to handle this
 besides adding some ancillary cmsg data to msg_control. This still would
 not solve the problem fundamentally, as a -EFAULT/-EINVAL return value
 could also mean that msg_control should not be touched, thus we end up
 again relying on errno checking. :/ Thus checking error queue after
 receiving an error indications is my best hunch so far.

 Your proposal with MSG_IGNORE_ERROR seems reasonable so far for ping or
 udp, but I haven't fully grasped the TCP semantics of sk-sk_err, yet.

I always assumed that TCP didn't have transient errors.  Shouldn't a
connection either be up or down but not up with errors?  If that's
wrong, then it's probably worth understanding what's going on before
trying to design a fix.

--Andy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Andy Lutomirski
On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa
han...@stressinduktion.org wrote:
 On Tue, Jun 2, 2015, at 23:42, Hannes Frederic Sowa wrote:
 On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote:
  On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa
  han...@stressinduktion.org wrote:
   On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote:
 
  [...]
 
  I do this already, which makes me think that there's a bug or another
  race somewhere.  I've only seen a failure once in several years of
  operation.
 
  The failure happened on a ping socket.  I suspect that the race is:
 
  ping_err: ip_icmp_error(...);
 
  user: recvmsg(MSG_ERRQUEUE) and dequeues the error.
 
  ping_err: sk_err = err;
 
  user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the
  error via sock_error.
 
  user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN.
 
  Now the user code thinks that it was a real (non-transient) error and
  aborts.
 
  Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE?

 Hmm, I don't think this will help.

  Even if this race were fixed, this interface still sucks IMO.

 Yes. :/

 My proposal would be to make the error conversion lazy:

 Keeping duplicate data is not a good idea in general: So we shouldn't
 use sk-sk_err if IP_RECVERR is set at all but let sock_error just use
 the sk_error_queue and extract the error code from there.

 Only if IP_RECVERR was not set, we use sk-sk_err logic.

 What do you think?

 I just noticed that this will probably break existing user space
 applications which require that icmp errors are transient even with
 IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer
 and xchg the pointer, hmmm

Do you mean to fix the race like this but to otherwise leave the semantics
alone?  That would be an improvement, but it might be nice to also add
a non-crappy API for this, too.

--Andy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 5/9] net: dsa: mv88e6352: disable mirroring

2015-06-02 Thread Vivien Didelot
Hi Guenter, Andrew,

On Jun 2, 2015, at 10:53 AM, Andrew Lunn and...@lunn.ch wrote:
On Tue, Jun 02, 2015 at 07:16:10AM -0700, Guenter Roeck wrote:
 On 06/01/2015 06:27 PM, Vivien Didelot wrote:
 Disable the mirroring policy in the monitor control register, since this
 feature is not needed.
 
 Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
 
 Should this be a separate patch, unrelated to the patch set ?

Indeed, this one is an unrelated patch, sorry.

 If I understand correctly, this effectively disables IGMP/MLD snooping.
 I think this warrants an explanation why that it not needed, not just
 a statement that it is not needed.
 
 +1
 
 Especially since we might want to revisit this to implement IGMP/MLD
 snooping in the bridge. The hardware should be capable of it.

This is something I want to disable because I can have several times
gigabit traffic on my ports. This would end up in a bottleneck on the
CPU port. Am I right?

Thanks,
-v
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net iproute2 v4] mpls: always set type RTN_UNICAST and scope RT_SCOPE_UNIVERSE for route add/deletes

2015-06-02 Thread Eric W. Biederman
Roopa Prabhu ro...@cumulusnetworks.com writes:

 From: Roopa Prabhu ro...@cumulusnetworks.com

 This patch fixes incorrect -EINVAL errors due to invalid
 scope and type during mpls route deletes.

 $ip -f mpls route add 100 as 200 via inet 10.1.1.2 dev swp1

 $ip -f mpls route show
 100 as to 200 via inet 10.1.1.2 dev swp1

 $ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1
 RTNETLINK answers: Invalid argument

 $ip -f mpls route del 100
 RTNETLINK answers: Invalid argument

 After patch:

 $ip -f mpls route show
 100 as to 200 via inet 10.1.1.2 dev swp1

 $ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1

 $ip -f mpls route show

 Always set type to RTN_UNICAST for mpls route add/deletes.
 Also to keep things consistent with kernel set scope to
 RT_SCOPE_UNIVERSE for both mpls and ipv6 routes. Both mpls and ipv6 route
 deletes ignore scope.

Acked-by: Eric W. Biederman ebied...@xmission.com


 Suggested-by: Eric W. Biederman ebied...@xmission.com
 Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
 Signed-off-by: Vivek Venkataraman vi...@cumulusnetworks.com
 --
 v4 move fix to iproute2
 ---
  ip/iproute.c |   16 
  1 file changed, 12 insertions(+), 4 deletions(-)

 diff --git a/ip/iproute.c b/ip/iproute.c
 index 670a4c6..d0b9910 100644
 --- a/ip/iproute.c
 +++ b/ip/iproute.c
 @@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   int scope_ok = 0;
   int table_ok = 0;
   int raw = 0;
 + int type_ok = 0;
  
   memset(req, 0, sizeof(req));
  
 @@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   rtnl_rtntype_a2n(type, *argv) == 0) {
   NEXT_ARG();
   req.r.rtm_type = type;
 + type_ok = 1;
   }
  
   if (matches(*argv, help) == 0)
 @@ -1136,6 +1138,9 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   if (nhs_ok)
   parse_nexthops(req.n, req.r, argc, argv);
  
 + if (req.r.rtm_family == AF_UNSPEC)
 + req.r.rtm_family = AF_INET;
 +
   if (!table_ok) {
   if (req.r.rtm_type == RTN_LOCAL ||
   req.r.rtm_type == RTN_BROADCAST ||
 @@ -1144,8 +1149,11 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   req.r.rtm_table = RT_TABLE_LOCAL;
   }
   if (!scope_ok) {
 - if (req.r.rtm_type == RTN_LOCAL ||
 - req.r.rtm_type == RTN_NAT)
 + if (req.r.rtm_family == AF_INET6 ||
 + req.r.rtm_family == AF_MPLS)
 + req.r.rtm_scope = RT_SCOPE_UNIVERSE;
 + else if (req.r.rtm_type == RTN_LOCAL ||
 +  req.r.rtm_type == RTN_NAT)
   req.r.rtm_scope = RT_SCOPE_HOST;
   else if (req.r.rtm_type == RTN_BROADCAST ||
req.r.rtm_type == RTN_MULTICAST ||
 @@ -1160,8 +1168,8 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   }
   }
  
 - if (req.r.rtm_family == AF_UNSPEC)
 - req.r.rtm_family = AF_INET;
 + if (!type_ok  req.r.rtm_family == AF_MPLS)
 + req.r.rtm_type = RTN_UNICAST;
  
   if (rtnl_talk(rth, req.n, 0, 0, NULL)  0)
   return -2;
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 8/9] net: dsa: mv88e6352: set port 802.1Q mode to Secure

2015-06-02 Thread Guenter Roeck

On 06/01/2015 06:27 PM, Vivien Didelot wrote:

This commit changes the 802.1Q mode of each port from Disabled to
Secure. This enables the VLAN support, by checking the VTU entries on
ingress.

Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
---
  drivers/net/dsa/mv88e6xxx.c | 14 +++---
  drivers/net/dsa/mv88e6xxx.h |  5 +
  2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index ed49bd8..35243d8 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1723,13 +1723,11 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
goto abort;
}

-   /* Port Control 2: don't force a good FCS, set the maximum
-* frame size to 10240 bytes, don't let the switch add or
-* strip 802.1q tags, don't discard tagged or untagged frames
-* on this port, do a destination address lookup on all
-* received packets as usual, disable ARP mirroring and don't
-* send a copy of all transmitted/received frames on this port
-* to the CPU.
+   /* Port Control 2: don't force a good FCS, set the maximum frame size to
+* 10240 bytes, enable secure 802.1q tags, don't discard tagged or
+* untagged frames on this port, do a destination address lookup on all
+* received packets as usual, disable ARP mirroring and don't send a
+* copy of all transmitted/received frames on this port to the CPU.
 */
reg = 0;
if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) ||
@@ -1751,6 +1749,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
reg |= PORT_CONTROL_2_FORWARD_UNKNOWN;
}

+   reg |= PORT_CONTROL_2_8021Q_SECURE;
+


Vivien,

With this patch, my non-VLAN configuration fails; it appears that untagged
packets are no longer received. I found two possible solutions:
- Use PORT_CONTROL_2_8021Q_FALLBACK
- Explicitly add a VLAN entry for vid=0.

Guenter

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls

2015-06-02 Thread Eric W. Biederman
Thomas Graf tg...@suug.ch writes:

 On 06/02/15 at 01:26pm, Eric W. Biederman wrote:
 What we really want here is xfrm-lite.  By lite I mean the tunnel
 selection criteria is simple enough that it fits into the normal
 routing table instead of having to do weird flow based magic that
 is rarely needed.
 
 I believe what we want are the xfrm stacking of dst entries.

 I assume you are referring to reusing the selector and stacked
 dst. I considered that for the transmit side.

 Can you elaborate on this some more? How would this look like
 for the specific case of VXLAN? Any thoughts on the receive
 side? You also mention that you dislike the net_device approach.
 What do you suggest instead? The encapsulation is often postponed
 to after the packet is fully constructed. Where should it get
 hooked into?

Thomas I may have misunderstood what you are trying to do.

Is what you were aiming for roughly the existing RTA_FLOW so you can
transmit packets out one network device and have enough information to
know which of a set of tunnels of a given type you want the packets go
into?

Eric



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 8/9] net: dsa: mv88e6352: set port 802.1Q mode to Secure

2015-06-02 Thread Vivien Didelot
Hi Guenter,

On Jun 2, 2015, at 10:31 AM, Guenter Roeck li...@roeck-us.net wrote:
On 06/01/2015 06:27 PM, Vivien Didelot wrote:
 This commit changes the 802.1Q mode of each port from Disabled to
 Secure. This enables the VLAN support, by checking the VTU entries on
 ingress.

 Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
 ---
   drivers/net/dsa/mv88e6xxx.c | 14 +++---
   drivers/net/dsa/mv88e6xxx.h |  5 +
   2 files changed, 12 insertions(+), 7 deletions(-)

 diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
 index ed49bd8..35243d8 100644
 --- a/drivers/net/dsa/mv88e6xxx.c
 +++ b/drivers/net/dsa/mv88e6xxx.c
 @@ -1723,13 +1723,11 @@ static int mv88e6xxx_setup_port(struct dsa_switch 
 *ds,
 int port)
  goto abort;
  }

 -/* Port Control 2: don't force a good FCS, set the maximum
 - * frame size to 10240 bytes, don't let the switch add or
 - * strip 802.1q tags, don't discard tagged or untagged frames
 - * on this port, do a destination address lookup on all
 - * received packets as usual, disable ARP mirroring and don't
 - * send a copy of all transmitted/received frames on this port
 - * to the CPU.
 +/* Port Control 2: don't force a good FCS, set the maximum frame size to
 + * 10240 bytes, enable secure 802.1q tags, don't discard tagged or
 + * untagged frames on this port, do a destination address lookup on all
 + * received packets as usual, disable ARP mirroring and don't send a
 + * copy of all transmitted/received frames on this port to the CPU.
   */
  reg = 0;
  if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) ||
 @@ -1751,6 +1749,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
 int
 port)
  reg |= PORT_CONTROL_2_FORWARD_UNKNOWN;
  }

 +reg |= PORT_CONTROL_2_8021Q_SECURE;
 +
 
 Hi Vivien,
 
 Unless I misunderstand the documentation, this effectively disables VLAN
 support on non-bridge ports, especially since the ndo_ functions to add VLAN
 entries to such ports are not implemented. Is that intentional, or am I
 missing something ?

Indeed, I intentionaly set the port mode to Secure to work on 802.1q.
For both cases, the Fallback mode should be enough; this mode checks the
VTU for a valid entry, otherwise checks the port-based VLAN map.

Supporting port-based VLAN looks like another tricky thread.

Ideally, this must be configurable. In my case I do need strict 802.1q.
Can ethtool/iproute2 can do something about the port 802.1q mode?

Thanks,
-v
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 0/2] pci: Provide a flag to access VPD through function 0

2015-06-02 Thread Mark D Rustad
Many multi-function devices provide shared registers in extended
config space for accessing VPD. The behavior of these registers
means that the state must be tracked and access locked correctly
for accesses not to hang or worse. One way to meet these needs is
to always perform the accesses through function 0, thereby using
the state tracking and mutex that already exists.

To provide this behavior, add a dev_flags bit to indicate that this
should be done. This bit can then be set for any non-zero function
that needs to redirect such VPD access to function 0. Do not set
this bit on the zero function or there will be an infinite recursion.

The second patch uses this new flag to invoke this behavior on all
multi-function Intel Ethernet devices.

Signed-off-by: Mark Rustad mark.d.rus...@intel.com

---
Changes in V2:
- Corrected a spelling error in a log message
- Added checks to see that the referenced function 0 is reasonable

---

Mark D Rustad (2):
  pci: Add dev_flags bit to access VPD through function 0
  pci: Add VPD quirk for Intel Ethernet devices


 drivers/pci/access.c |   48 +++-
 drivers/pci/quirks.c |9 +
 2 files changed, 56 insertions(+), 1 deletion(-)

-- 
Mark Rustad, Network Division, Intel Corporation
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 1/2] pci: Add dev_flags bit to access VPD through function 0

2015-06-02 Thread Mark D Rustad
Add a dev_flags bit, PCI_DEV_FLAGS_VPD_REF_F0, to access VPD through
function 0 to provide VPD access on other functions. This solves
concurrent access problems on many devices without changing the
attributes exposed in sysfs. Never set this bit on function 0 or
there will be an infinite recursion.

Signed-off-by: Mark Rustad mark.d.rus...@intel.com
---
Changes in V2:
- Corrected spelling in log message
- Added checks to see that the referenced function 0 is reasonable
---
 drivers/pci/access.c |   48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index d9b64a175990..74634d4868a2 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -439,6 +439,40 @@ static const struct pci_vpd_ops pci_vpd_pci22_ops = {
.release = pci_vpd_pci22_release,
 };
 
+static ssize_t pci_vpd_f0_read(struct pci_dev *dev, loff_t pos, size_t count,
+  void *arg)
+{
+   struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn));
+   ssize_t ret;
+
+   if (!tdev)
+   return -ENODEV;
+
+   ret = pci_read_vpd(tdev, pos, count, arg);
+   pci_dev_put(tdev);
+   return ret;
+}
+
+static ssize_t pci_vpd_f0_write(struct pci_dev *dev, loff_t pos, size_t count,
+   const void *arg)
+{
+   struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn));
+   ssize_t ret;
+
+   if (!tdev)
+   return -ENODEV;
+
+   ret = pci_write_vpd(tdev, pos, count, arg);
+   pci_dev_put(tdev);
+   return ret;
+}
+
+static const struct pci_vpd_ops pci_vpd_f0_ops = {
+   .read = pci_vpd_f0_read,
+   .write = pci_vpd_f0_write,
+   .release = pci_vpd_pci22_release,
+};
+
 int pci_vpd_pci22_init(struct pci_dev *dev)
 {
struct pci_vpd_pci22 *vpd;
@@ -447,12 +481,24 @@ int pci_vpd_pci22_init(struct pci_dev *dev)
cap = pci_find_capability(dev, PCI_CAP_ID_VPD);
if (!cap)
return -ENODEV;
+   if (dev-dev_flags  PCI_DEV_FLAGS_VPD_REF_F0) {
+   struct pci_dev *tdev;
+
+   tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn));
+   if (!tdev || !dev-multifunction || !tdev-multifunction ||
+   dev-class != tdev-class || dev-vendor != tdev-vendor ||
+   dev-device != tdev-device)
+   return -ENODEV;
+   }
vpd = kzalloc(sizeof(*vpd), GFP_ATOMIC);
if (!vpd)
return -ENOMEM;
 
vpd-base.len = PCI_VPD_PCI22_SIZE;
-   vpd-base.ops = pci_vpd_pci22_ops;
+   if (dev-dev_flags  PCI_DEV_FLAGS_VPD_REF_F0)
+   vpd-base.ops = pci_vpd_f0_ops;
+   else
+   vpd-base.ops = pci_vpd_pci22_ops;
mutex_init(vpd-lock);
vpd-cap = cap;
vpd-busy = false;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8dc4c6e..194df6d635e6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -180,6 +180,8 @@ enum pci_dev_flags {
PCI_DEV_FLAGS_NO_BUS_RESET = (__force pci_dev_flags_t) (1  6),
/* Do not use PM reset even if device advertises NoSoftRst- */
PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1  7),
+   /* Get VPD from function 0 VPD */
+   PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1  8),
 };
 
 enum pci_irq_reroute_variant {

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net iproute2 v4] mpls: always set type RTN_UNICAST and scope RT_SCOPE_UNIVERSE for route add/deletes

2015-06-02 Thread Roopa Prabhu
From: Roopa Prabhu ro...@cumulusnetworks.com

This patch fixes incorrect -EINVAL errors due to invalid
scope and type during mpls route deletes.

$ip -f mpls route add 100 as 200 via inet 10.1.1.2 dev swp1

$ip -f mpls route show
100 as to 200 via inet 10.1.1.2 dev swp1

$ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1
RTNETLINK answers: Invalid argument

$ip -f mpls route del 100
RTNETLINK answers: Invalid argument

After patch:

$ip -f mpls route show
100 as to 200 via inet 10.1.1.2 dev swp1

$ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1

$ip -f mpls route show

Always set type to RTN_UNICAST for mpls route add/deletes.
Also to keep things consistent with kernel set scope to
RT_SCOPE_UNIVERSE for both mpls and ipv6 routes. Both mpls and ipv6 route
deletes ignore scope.

Suggested-by: Eric W. Biederman ebied...@xmission.com
Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
Signed-off-by: Vivek Venkataraman vi...@cumulusnetworks.com
--
v4 move fix to iproute2
---
 ip/iproute.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 670a4c6..d0b9910 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
int scope_ok = 0;
int table_ok = 0;
int raw = 0;
+   int type_ok = 0;
 
memset(req, 0, sizeof(req));
 
@@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
rtnl_rtntype_a2n(type, *argv) == 0) {
NEXT_ARG();
req.r.rtm_type = type;
+   type_ok = 1;
}
 
if (matches(*argv, help) == 0)
@@ -1136,6 +1138,9 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
if (nhs_ok)
parse_nexthops(req.n, req.r, argc, argv);
 
+   if (req.r.rtm_family == AF_UNSPEC)
+   req.r.rtm_family = AF_INET;
+
if (!table_ok) {
if (req.r.rtm_type == RTN_LOCAL ||
req.r.rtm_type == RTN_BROADCAST ||
@@ -1144,8 +1149,11 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
req.r.rtm_table = RT_TABLE_LOCAL;
}
if (!scope_ok) {
-   if (req.r.rtm_type == RTN_LOCAL ||
-   req.r.rtm_type == RTN_NAT)
+   if (req.r.rtm_family == AF_INET6 ||
+   req.r.rtm_family == AF_MPLS)
+   req.r.rtm_scope = RT_SCOPE_UNIVERSE;
+   else if (req.r.rtm_type == RTN_LOCAL ||
+req.r.rtm_type == RTN_NAT)
req.r.rtm_scope = RT_SCOPE_HOST;
else if (req.r.rtm_type == RTN_BROADCAST ||
 req.r.rtm_type == RTN_MULTICAST ||
@@ -1160,8 +1168,8 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
}
}
 
-   if (req.r.rtm_family == AF_UNSPEC)
-   req.r.rtm_family = AF_INET;
+   if (!type_ok  req.r.rtm_family == AF_MPLS)
+   req.r.rtm_type = RTN_UNICAST;
 
if (rtnl_talk(rth, req.n, 0, 0, NULL)  0)
return -2;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 7/9] net: dsa: mv88e6352: lock CPU port from learning addresses

2015-06-02 Thread Vivien Didelot
Hi Guenter,

On Jun 2, 2015, at 10:24 AM, Guenter Roeck li...@roeck-us.net wrote:
On 06/01/2015 06:27 PM, Vivien Didelot wrote:
 This commit disables SA learning and refreshing for the CPU port.

 
 Hi Vivien,
 
 This patch also seems to be unrelated to the rest of the series.
 
 Can you add an explanation why it is needed ?
 
 With this in place, how does the CPU port SA find its way into the fdb ?
 Do we assume that it will be configured statically ?
 An explanation might be useful.

Without this patch, I noticed the CPU port was stealing the SA of a PC
behind a switch port. this happened when the port was a bridge member,
as the bridge was relaying broadcast coming from one switch port to the
other switch ports in the same vlan.

Thanks,
-v
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Andy Lutomirski
On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa
han...@stressinduktion.org wrote:
 On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote:
 As far as I can tell, enabling IP_RECVERR causes the presence of a
 queued error to cause recvmsg, etc to return an error (once).  It's
 worse, though: a new error can be queued asynchronously at any time,
 this setting sk_err to a nonzero value.  How do I sensibly distinguish
 recvmsg failures to to genuine errors receiving messages from recvmsg
 failures because there's a queued error?

 The only way I can see to get reliable error handling is to literally
 call recvmsg in a loop:

 while (true /* or while POLLIN is set */) {
   int ret = recvmsg(..., MSG_ERRQUEUE not set);
   if (ret  0  /* what goes here? */) {
 whoops!  this might be a harmless asynchronous error!
 take no action!
   }

 I see either two possibilities:

 We export the icmp_err_convert tables along with the udp_lib_err error
 conversions to user space and spice them up with flags to mark if they
 are transient (icmp_err_convert already has a fatal flag).

This seems overcomplicated.  I'd rather have a flag I pass to tell the
kernel that I don't want to see transient errors (nd that I'll clear
them myself using POLLERR and either MSG_ERRQUEUE or SO_ERROR.


 Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after
 you got a ret  0 when calling without MSG_ERRQUEUE and inspect the
 sock_extended_err, no?

I do this already, which makes me think that there's a bug or another
race somewhere.  I've only seen a failure once in several years of
operation.

The failure happened on a ping socket.  I suspect that the race is:

ping_err: ip_icmp_error(...);

user: recvmsg(MSG_ERRQUEUE) and dequeues the error.

ping_err: sk_err = err;

user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the
error via sock_error.

user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN.

Now the user code thinks that it was a real (non-transient) error and aborts.

Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE?

Even if this race were fixed, this interface still sucks IMO.

--Andy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next v2] ipv4: inet_bind: check the addr_len first

2015-06-02 Thread Hannes Frederic Sowa


On Tue, Jun 2, 2015, at 17:13, Denis Kirjanov wrote:
 On 6/2/15, Hannes Frederic Sowa han...@stressinduktion.org wrote:
  Hello,
 
  On Tue, Jun 2, 2015, at 14:21, Denis Kirjanov wrote:
  Perform the address length check first, before calling
  the proto specific bind() function
 
  Can you give more detail why you did this change and what bug it fixes?
 I've sent the v2 version with the net-next tag. The idea is simple:
 check the error condition first and then do the useful work.

Hmm, IMHO the specific proto-bind handlers have to take care of the
check themselves. You could argue that we should do the checks always in
inet_bind but then you have to remove the addr_len checks from the raw
and ping bind handlers, otherwise people become confused if they modify
the code. I am in favor of leaving the current logic as is, sorry.

Thanks,
Hannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Hannes Frederic Sowa
On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote:
 On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa
 han...@stressinduktion.org wrote:
  On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote:
  As far as I can tell, enabling IP_RECVERR causes the presence of a
  queued error to cause recvmsg, etc to return an error (once).  It's
  worse, though: a new error can be queued asynchronously at any time,
  this setting sk_err to a nonzero value.  How do I sensibly distinguish
  recvmsg failures to to genuine errors receiving messages from recvmsg
  failures because there's a queued error?
 
  The only way I can see to get reliable error handling is to literally
  call recvmsg in a loop:
 
  while (true /* or while POLLIN is set */) {
int ret = recvmsg(..., MSG_ERRQUEUE not set);
if (ret  0  /* what goes here? */) {
  whoops!  this might be a harmless asynchronous error!
  take no action!
}
 
  I see either two possibilities:
 
  We export the icmp_err_convert tables along with the udp_lib_err error
  conversions to user space and spice them up with flags to mark if they
  are transient (icmp_err_convert already has a fatal flag).
 
 This seems overcomplicated.  I'd rather have a flag I pass to tell the
 kernel that I don't want to see transient errors (nd that I'll clear
 them myself using POLLERR and either MSG_ERRQUEUE or SO_ERROR.
 
 
  Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after
  you got a ret  0 when calling without MSG_ERRQUEUE and inspect the
  sock_extended_err, no?
 
 I do this already, which makes me think that there's a bug or another
 race somewhere.  I've only seen a failure once in several years of
 operation.
 
 The failure happened on a ping socket.  I suspect that the race is:
 
 ping_err: ip_icmp_error(...);
 
 user: recvmsg(MSG_ERRQUEUE) and dequeues the error.
 
 ping_err: sk_err = err;
 
 user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the
 error via sock_error.
 
 user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN.
 
 Now the user code thinks that it was a real (non-transient) error and
 aborts.
 
 Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE?

Hmm, I don't think this will help.

 Even if this race were fixed, this interface still sucks IMO.

Yes. :/

My proposal would be to make the error conversion lazy:

Keeping duplicate data is not a good idea in general: So we shouldn't
use sk-sk_err if IP_RECVERR is set at all but let sock_error just use
the sk_error_queue and extract the error code from there.

Only if IP_RECVERR was not set, we use sk-sk_err logic.

What do you think?

Bye,
Hannes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Robert Shearman

On 02/06/15 22:10, Eric W. Biederman wrote:

Robert Shearman rshea...@brocade.com writes:


On 02/06/15 19:11, Eric W. Biederman wrote:

Robert Shearman rshea...@brocade.com writes:


In order to be able to function as a Label Edge Router in an MPLS
network, it is necessary to be able to take IP packets and impose an
MPLS encap and forward them out. The traditional approach of setting
up an interface for each tunnel endpoint doesn't scale for the
common MPLS use-cases where each IP route tends to be assigned a
different label as encap.

The solution suggested here for further discussion is to provide the
facility to define encap data on a per-nexthop basis using a new
netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
forwarding code, but interpreted by the virtual interface assigned to
the nexthop.

A new ipmpls interface type is defined to show the use of this
facility to allow IP packets to be imposed with an MPLS
encap. However, the facility is designed to be general enough to be
used by any encapsulation/tunneling mechanism that has similar
requirements of high-scale, high-variation-of-encap.


I am still digging into the details but adding a new network device to
make this possible if very undesirable.

It is a pain point.  Those network devices get to be a major source of
memory consumption when there are 4K network namespaces in existence.

It is conceptually wrong.  The network device will never be used as an
ordinary network device.  All the network device gives you is the
ability to avoid creating an enumeration of different kinds of
encapsulation.


This isn't true. The network device also gives some of the things you
take for granted. Things like fragmentation through specifying the mtu
on the shared tunnel device, being able to specify rules using the
shared tunnel output device, IP stats, and the ability specify a
different destination namespace.


Granted you get a few more things.  It is still conceptually wrong as
the network device will netver be used as an ordinary network device.

Fragmentation is already silly because we are talking about multiple
tunnels with different properties.  You need per-route mtu to handle
that case.


It's unlikely you'll have a huge variation in the mtus across routes, 
unless you're running in an ISP environment. In the example uses we've 
got in hand, it's highly likely they'll only be a handful of different 
mtus, if that.



Further I am not saying you don't need an output device (which is what
is needed to specify a different destination namespace) I am saying that
having a funny mpls device is wrong as far as I can see.  Certainly it
is a lot of bloody unnecessary overhead.

If we are going to design for maximum scaling (and 1 million+ routes)
sounds like maximum scaling we should see how far we can go without
dragging in the horrible heaviness of additional network devices.  35K a
piece last I measured it.  Just a small handful of them are already
scaling issues for network namespaces.


For the ipmpls interface I've implemented here, you only need one per 
namespace. You could argue the same for the veth interfaces which would 
be much more commonly used in network namespaces.


BTW, maybe I've missed something, or maybe netdevs have gone on a diet, 
but I count the cost of creating a basic interface at ~2700 bytes on x86_64:
sizeof(struct net_device) /* 2112 */ + 1 * sizeof(struct netdev_queue) 
/* 384 */ + 1 * sizeof(struct netdev_rx_queue) /* 128 */ + sizeof(struct 
netdev_hw_addr) /* 80 */ + sizeof(int) * nr_poss_cpus /* 4 * n */)


Thanks,
Rob
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls

2015-06-02 Thread Eric W. Biederman
Thomas Graf tg...@suug.ch writes:

 On 06/02/15 at 01:26pm, Eric W. Biederman wrote:
 What we really want here is xfrm-lite.  By lite I mean the tunnel
 selection criteria is simple enough that it fits into the normal
 routing table instead of having to do weird flow based magic that
 is rarely needed.
 
 I believe what we want are the xfrm stacking of dst entries.

 I assume you are referring to reusing the selector and stacked
 dst. I considered that for the transmit side.

 Can you elaborate on this some more? How would this look like
 for the specific case of VXLAN? Any thoughts on the receive
 side? You also mention that you dislike the net_device approach.
 What do you suggest instead? The encapsulation is often postponed
 to after the packet is fully constructed. Where should it get
 hooked into?

Things I think xfrm does correct today:
- Transmitting things when an appropriate dst has been found.

Things I think xfrm could do better:
- Finding the dst entry.  Having to perform a separate lookup in a
  second set of tables looks slow, and not much maintained.

So if we focus on the normal routing case where lookup works today (aka
no source port or destination port based routing or any of the other
weird things so we can use a standard fib lookup I think I can explain
what I imagine things would look like.

To be clear I am focusing on the very light weight tunnels and I am not
certain vxlan applies.  It may be more reasonable to simply have a
single ethernet looking device that does speaks vxlan behind the scenes.

If I look at vxlan as a set of ipv4 host routes (no arp, no unknown host
support) it looks like the kind of light-weight tunnel that we are
dealing with for mpls.

On the reception side packets that match the magic udp socket have their
tunneling bits stripped off and pushed up to the ip layer.  Roughly
equivalent to the current af_mpls code.

On the transmit side there would be a host route for each remote host.
In the fib we would store a pointer to a data structure that holds a
precomputed header to be prepended to the packet (inner ethernet, vxlan,
outer udp, outer ip).  That data pointer would become dst-xfrm when the
route lookup happens and we generate a route/dst entry.  There would
also be an output function in the fib and that output function would
be compue dst-output.  I would be more specific but I forget the
details of the fib_trie data structures.

The output function function in the dst entry in the ipv4 route would
know how to interpret the pointer in the ipv4 routing table, append
the precomputed headers, update the precomputed udp header's source port
with the flow hash of the the inner packet, and have an inner dst
so that would essentially call ip_finish_output2 again and sending
the packet to it's destination.

There is some wiggle room but that is how I imagine things working, and
that is what I think we want for the mpls case.  Adding two pointers to
the fib could be interesting.  One pointer can be a union with the
output network device, the other pointer I am not certain about.

And of course we get fun cases where we have tunnels running through
other tunnels.  So there likely needs to be a bit of indirection going
on.

The problem I think needs to be solved is how to make tunnels very light
weight and cheap, so the can scale to 1million+.  Enough so that the
kernel can hold a full routing table full of tunnels.

It looks like xfrm is almost there but it's data structures appear to be
excessively complicated and inscrutible, and the require an extra lookup.

Eric

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls

2015-06-02 Thread Thomas Graf
On 06/02/15 at 01:26pm, Eric W. Biederman wrote:
 What we really want here is xfrm-lite.  By lite I mean the tunnel
 selection criteria is simple enough that it fits into the normal
 routing table instead of having to do weird flow based magic that
 is rarely needed.
 
 I believe what we want are the xfrm stacking of dst entries.

I assume you are referring to reusing the selector and stacked
dst. I considered that for the transmit side.

Can you elaborate on this some more? How would this look like
for the specific case of VXLAN? Any thoughts on the receive
side? You also mention that you dislike the net_device approach.
What do you suggest instead? The encapsulation is often postponed
to after the packet is fully constructed. Where should it get
hooked into?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Thomas Graf
On 06/02/15 at 02:28pm, Robert Shearman wrote:
 Nesting attributes inside the RTA_ENCAP blob should be supported by the
 patch series today. Something like this:

Sure. I'm not seeing such a construct for the MPLS case yet.

I'm happy to rebase my patches on top of your nexthop implementation.
It is definitely superior. Are you maintaining a git tree somewhere?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3 2/2] mpls: fix mpls route deletes to not check for route scope

2015-06-02 Thread roopa

On 6/2/15, 2:13 PM, Eric W. Biederman wrote:

So I just stopped and looked at what is happening.  When you originally
reported this you said (or at least I understood) that rtm_scope was not
being set in iproute.  I assumed that meant it was not being touched
and it was taking a default value of zero (or else it was possibly
floating).  Having looked neither is true.  iproute sets rtm_scope
to RT_SCOPE_NOWHERE during delete deliberately to act as a wild card.

In the kernel in other protocols currently ipv4 treats RT_SCOPE_NOWHERE
as a wild card during delete, decnet treats RT_SCOPE_NOWHERE as a wild
card during delete, the remaining protocols (ipv6, phonet, and can) that
implement RTM_DELROUTE do not look at rtm_scope at all.  Further ipv6
and phonet set rtm_scope to RT_SCOPE_UNIVERSE when dumped.

Which says to me that we have semantics in the kernel that no one has
let userspace know about, and that scares me when there is a
misunderstanding between the kernel and userspace about what fields
mean.  That inevitabily leads to bugs.  The kind of bugs that I have
to create security fixes for recently.

So I really think we should fix this in userspace so that that someone
reading iproute will have a chance at knowing that this scopes do not
exist in ipv6 and mpls and that scope logic is just noise in those
cases.

ack,  i did start with handling both type and scope
in iproute2. I misunderstood you when you said you did not care
abt the scope in earlier comments. so i made the kernel not care abt the
scope. :) but only handled type in 'iproute2' in v2.  now its clear. I 
do have a similar patch like below.
sorry abt the iterations. I will respin (If you prefer to post your 
below patch yourself, pls do. I am ok either way. Thanks.


Something like:

 From 837dddea49af874fe750ab0712b3ef8066a2f55a Mon Sep 17 00:00:00 2001
From: Eric W. Biederman ebied...@xmission.com
Date: Tue, 2 Jun 2015 15:51:31 -0500
Subject: [PATCH] iproute: When deleting routes don't always set the scope to 
RT_SCOPE_NOWHERE

IPv6 and MPLS do not implement scopes on addresses and using
RT_SCOPE_NOWHERE is just confusing noise.  Use RT_SCOPE_UNIVERSE
instead so that it is clear what is actually happening in the code.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
---
  ip/iproute.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index fba475f65314..e9b991fdf62f 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1136,6 +1136,9 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
if (nhs_ok)
parse_nexthops(req.n, req.r, argc, argv);
  
+	if (req.r.rtm_family == AF_UNSPEC)

+   req.r.rtm_family = AF_INET;
+
if (!table_ok) {
if (req.r.rtm_type == RTN_LOCAL ||
req.r.rtm_type == RTN_BROADCAST ||
@@ -1144,7 +1147,10 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
req.r.rtm_table = RT_TABLE_LOCAL;
}
if (!scope_ok) {
-   if (req.r.rtm_type == RTN_LOCAL ||
+   if (req.r.rtm_family == AF_INET6 ||
+   req.r.rtm_family == AF_MPLS)
+   req.r.rtm_scope = RT_SCOPE_UNIVERSE;
+   else if (req.r.rtm_type == RTN_LOCAL ||
req.r.rtm_type == RTN_NAT)
req.r.rtm_scope = RT_SCOPE_HOST;
else if (req.r.rtm_type == RTN_BROADCAST ||
@@ -1160,9 +1166,6 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
}
}
  
-	if (req.r.rtm_family == AF_UNSPEC)

-   req.r.rtm_family = AF_INET;
-
if (rtnl_talk(rth, req.n, NULL, 0)  0)
return -2;
  


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3 2/2] mpls: fix mpls route deletes to not check for route scope

2015-06-02 Thread Eric W. Biederman
roopa ro...@cumulusnetworks.com writes:

 On 6/2/15, 2:13 PM, Eric W. Biederman wrote:
 So I just stopped and looked at what is happening.  When you originally
 reported this you said (or at least I understood) that rtm_scope was not
 being set in iproute.  I assumed that meant it was not being touched
 and it was taking a default value of zero (or else it was possibly
 floating).  Having looked neither is true.  iproute sets rtm_scope
 to RT_SCOPE_NOWHERE during delete deliberately to act as a wild card.

 In the kernel in other protocols currently ipv4 treats RT_SCOPE_NOWHERE
 as a wild card during delete, decnet treats RT_SCOPE_NOWHERE as a wild
 card during delete, the remaining protocols (ipv6, phonet, and can) that
 implement RTM_DELROUTE do not look at rtm_scope at all.  Further ipv6
 and phonet set rtm_scope to RT_SCOPE_UNIVERSE when dumped.

 Which says to me that we have semantics in the kernel that no one has
 let userspace know about, and that scares me when there is a
 misunderstanding between the kernel and userspace about what fields
 mean.  That inevitabily leads to bugs.  The kind of bugs that I have
 to create security fixes for recently.

 So I really think we should fix this in userspace so that that someone
 reading iproute will have a chance at knowing that this scopes do not
 exist in ipv6 and mpls and that scope logic is just noise in those
 cases.
 ack,  i did start with handling both type and scope
 in iproute2. I misunderstood you when you said you did not care
 abt the scope in earlier comments. so i made the kernel not care abt the
 scope. :) but only handled type in 'iproute2' in v2.  now its clear. I do 
 have a
 similar patch like below.
 sorry abt the iterations. I will respin (If you prefer to post your below 
 patch
 yourself, pls do. I am ok either way. Thanks.

I don't have enough energy to follow through with more than review
today.

Eric
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3 2/2] mpls: fix mpls route deletes to not check for route scope

2015-06-02 Thread Eric W. Biederman
Roopa Prabhu ro...@cumulusnetworks.com writes:

 From: Roopa Prabhu ro...@cumulusnetworks.com

 Ignore scope for route del messages

So I just stopped and looked at what is happening.  When you originally
reported this you said (or at least I understood) that rtm_scope was not
being set in iproute.  I assumed that meant it was not being touched
and it was taking a default value of zero (or else it was possibly
floating).  Having looked neither is true.  iproute sets rtm_scope
to RT_SCOPE_NOWHERE during delete deliberately to act as a wild card.

In the kernel in other protocols currently ipv4 treats RT_SCOPE_NOWHERE
as a wild card during delete, decnet treats RT_SCOPE_NOWHERE as a wild
card during delete, the remaining protocols (ipv6, phonet, and can) that
implement RTM_DELROUTE do not look at rtm_scope at all.  Further ipv6
and phonet set rtm_scope to RT_SCOPE_UNIVERSE when dumped.

Which says to me that we have semantics in the kernel that no one has
let userspace know about, and that scares me when there is a
misunderstanding between the kernel and userspace about what fields
mean.  That inevitabily leads to bugs.  The kind of bugs that I have
to create security fixes for recently.

So I really think we should fix this in userspace so that that someone
reading iproute will have a chance at knowing that this scopes do not
exist in ipv6 and mpls and that scope logic is just noise in those
cases.

Something like:

From 837dddea49af874fe750ab0712b3ef8066a2f55a Mon Sep 17 00:00:00 2001
From: Eric W. Biederman ebied...@xmission.com
Date: Tue, 2 Jun 2015 15:51:31 -0500
Subject: [PATCH] iproute: When deleting routes don't always set the scope to 
RT_SCOPE_NOWHERE

IPv6 and MPLS do not implement scopes on addresses and using
RT_SCOPE_NOWHERE is just confusing noise.  Use RT_SCOPE_UNIVERSE
instead so that it is clear what is actually happening in the code.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
---
 ip/iproute.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index fba475f65314..e9b991fdf62f 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1136,6 +1136,9 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
if (nhs_ok)
parse_nexthops(req.n, req.r, argc, argv);
 
+   if (req.r.rtm_family == AF_UNSPEC)
+   req.r.rtm_family = AF_INET;
+
if (!table_ok) {
if (req.r.rtm_type == RTN_LOCAL ||
req.r.rtm_type == RTN_BROADCAST ||
@@ -1144,7 +1147,10 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
req.r.rtm_table = RT_TABLE_LOCAL;
}
if (!scope_ok) {
-   if (req.r.rtm_type == RTN_LOCAL ||
+   if (req.r.rtm_family == AF_INET6 ||
+   req.r.rtm_family == AF_MPLS)
+   req.r.rtm_scope = RT_SCOPE_UNIVERSE;
+   else if (req.r.rtm_type == RTN_LOCAL ||
req.r.rtm_type == RTN_NAT)
req.r.rtm_scope = RT_SCOPE_HOST;
else if (req.r.rtm_type == RTN_BROADCAST ||
@@ -1160,9 +1166,6 @@ static int iproute_modify(int cmd, unsigned flags, int 
argc, char **argv)
}
}
 
-   if (req.r.rtm_family == AF_UNSPEC)
-   req.r.rtm_family = AF_INET;
-
if (rtnl_talk(rth, req.n, NULL, 0)  0)
return -2;
 
-- 
2.2.1


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net iproute2 v3 1/2] mpls: always set type as RTN_UNICAST for route add/deletes

2015-06-02 Thread Eric W. Biederman
Roopa Prabhu ro...@cumulusnetworks.com writes:

 From: Roopa Prabhu ro...@cumulusnetworks.com

 Kernel expects type RTN_UNICAST for mpls route/dels

There almost a bug in this patch.  You test req.r.rtm_family just
before the default case of AF_UNSPEC is set to AF_INET.  Which should
not affect anything in this case but is down right confusing to think
about, and could lead to maintenance problems in the future.

Otherwise
Acked-by: Eric W. Biederman ebied...@xmission.com


 Signed-off-by: Vivek Venkataraman vi...@cumulusnetworks.com
 Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
 Reviewed-by: Robert Shearman rshea...@brocade.com 
 ---
  ip/iproute.c |5 +
  1 file changed, 5 insertions(+)

 diff --git a/ip/iproute.c b/ip/iproute.c
 index 670a4c6..71c088b 100644
 --- a/ip/iproute.c
 +++ b/ip/iproute.c
 @@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   int scope_ok = 0;
   int table_ok = 0;
   int raw = 0;
 + int type_ok = 0;
  
   memset(req, 0, sizeof(req));
  
 @@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   rtnl_rtntype_a2n(type, *argv) == 0) {
   NEXT_ARG();
   req.r.rtm_type = type;
 + type_ok = 1;
   }
  
   if (matches(*argv, help) == 0)
 @@ -1160,6 +1162,9 @@ static int iproute_modify(int cmd, unsigned flags, int 
 argc, char **argv)
   }
   }
  
 + if (!type_ok  req.r.rtm_family == AF_MPLS)
 + req.r.rtm_type = RTN_UNICAST;
 +
   if (req.r.rtm_family == AF_UNSPEC)
   req.r.rtm_family = AF_INET;
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2 -next] tc: {f,m}_bpf: allow to retrieve uds path from env

2015-06-02 Thread Daniel Borkmann
Allow to retrieve uds path from the environment, facilitates
also dealing with export a bit.

Signed-off-by: Daniel Borkmann dan...@iogearbox.net
---
 tc/f_bpf.c  | 6 --
 tc/m_bpf.c  | 6 --
 tc/tc_bpf.h | 2 ++
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/tc/f_bpf.c b/tc/f_bpf.c
index 597ef60..c21bf33 100644
--- a/tc/f_bpf.c
+++ b/tc/f_bpf.c
@@ -122,6 +122,7 @@ opt_bpf:
 
NEXT_ARG();
if (ebpf) {
+   bpf_uds_name = secure_getenv(BPF_ENV_UDS);
bpf_obj = *argv;
NEXT_ARG();
 
@@ -131,8 +132,9 @@ opt_bpf:
bpf_sec_name = *argv;
NEXT_ARG();
}
-   if (strcmp(*argv, export) == 0 ||
-   strcmp(*argv, exp) == 0) {
+   if (!bpf_uds_name 
+   (strcmp(*argv, export) == 0 ||
+strcmp(*argv, exp) == 0)) {
NEXT_ARG();
bpf_uds_name = *argv;
NEXT_ARG();
diff --git a/tc/m_bpf.c b/tc/m_bpf.c
index 0621157..9ddb667 100644
--- a/tc/m_bpf.c
+++ b/tc/m_bpf.c
@@ -105,6 +105,7 @@ opt_bpf:
 
NEXT_ARG();
if (ebpf) {
+   bpf_uds_name = secure_getenv(BPF_ENV_UDS);
bpf_obj = *argv;
NEXT_ARG();
 
@@ -114,8 +115,9 @@ opt_bpf:
bpf_sec_name = *argv;
NEXT_ARG();
}
-   if (strcmp(*argv, export) == 0 ||
-   strcmp(*argv, exp) == 0) {
+   if (!bpf_uds_name 
+   (strcmp(*argv, export) == 0 ||
+strcmp(*argv, exp) == 0)) {
NEXT_ARG();
bpf_uds_name = *argv;
NEXT_ARG();
diff --git a/tc/tc_bpf.h b/tc/tc_bpf.h
index 5a697e5..2ad8812 100644
--- a/tc/tc_bpf.h
+++ b/tc/tc_bpf.h
@@ -25,6 +25,8 @@
 #include utils.h
 #include bpf_scm.h
 
+#define BPF_ENV_UDSTC_BPF_UDS
+
 int bpf_parse_string(char *arg, bool from_file, __u16 *bpf_len,
 char **bpf_string, bool *need_release,
 const char separator);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Andy Lutomirski
On Tue, Jun 2, 2015 at 2:42 PM, Hannes Frederic Sowa
han...@stressinduktion.org wrote:
 On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote:
 On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa
 han...@stressinduktion.org wrote:
  On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote:
  As far as I can tell, enabling IP_RECVERR causes the presence of a
  queued error to cause recvmsg, etc to return an error (once).  It's
  worse, though: a new error can be queued asynchronously at any time,
  this setting sk_err to a nonzero value.  How do I sensibly distinguish
  recvmsg failures to to genuine errors receiving messages from recvmsg
  failures because there's a queued error?
 
  The only way I can see to get reliable error handling is to literally
  call recvmsg in a loop:
 
  while (true /* or while POLLIN is set */) {
int ret = recvmsg(..., MSG_ERRQUEUE not set);
if (ret  0  /* what goes here? */) {
  whoops!  this might be a harmless asynchronous error!
  take no action!
}
 
  I see either two possibilities:
 
  We export the icmp_err_convert tables along with the udp_lib_err error
  conversions to user space and spice them up with flags to mark if they
  are transient (icmp_err_convert already has a fatal flag).

 This seems overcomplicated.  I'd rather have a flag I pass to tell the
 kernel that I don't want to see transient errors (nd that I'll clear
 them myself using POLLERR and either MSG_ERRQUEUE or SO_ERROR.

 
  Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after
  you got a ret  0 when calling without MSG_ERRQUEUE and inspect the
  sock_extended_err, no?

 I do this already, which makes me think that there's a bug or another
 race somewhere.  I've only seen a failure once in several years of
 operation.

 The failure happened on a ping socket.  I suspect that the race is:

 ping_err: ip_icmp_error(...);

 user: recvmsg(MSG_ERRQUEUE) and dequeues the error.

 ping_err: sk_err = err;

 user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the
 error via sock_error.

 user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN.

 Now the user code thinks that it was a real (non-transient) error and
 aborts.

 Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE?

 Hmm, I don't think this will help.

It won't help this race, but it'll at least make it clearer that the
code has some kind of reasonably well-defined semantics.


 Even if this race were fixed, this interface still sucks IMO.

 Yes. :/

 My proposal would be to make the error conversion lazy:

 Keeping duplicate data is not a good idea in general: So we shouldn't
 use sk-sk_err if IP_RECVERR is set at all but let sock_error just use
 the sk_error_queue and extract the error code from there.

 Only if IP_RECVERR was not set, we use sk-sk_err logic.

 What do you think?

That seems entirely sensible to me, except that it might break some
existing application.  There's also this code:

if ((family == AF_INET  !inet_sock-recverr) ||
(family == AF_INET6  !inet6_sk(sk)-recverr)) {
if (!harderr || sk-sk_state != TCP_ESTABLISHED)
goto out;  -- skips the assignment to sk_err

which means that recverr kind of has the opposite semantics right now.

In fact, the man page agrees with the current behavior (minus the race):

   IP_RECVERR (since Linux 2.2)
  Enable extended reliable error message passing.  When enabled on
  a datagram socket, all generated errors will be queued in a per-
  socket error queue.  When the user  receives  an  error  from  a
  socket   operation,  the  errors  can  be  received  by  calling
  recvmsg(2)   withtheMSG_ERRQUEUEflagset. The
  sock_extended_err  structure describing the error will be passed
  in an ancillary message with the type IP_RECVERR and  the  level
  IPPROTO_IP.   This  is  useful  for  reliable  error handling on
  unconnected sockets.  The received data  portion  of  the  error
  queue contains the error packet.

The sensible semantics would be to change this to When the user
receives POLLERR, the errors can be received  So maybe there
should be another value for IP_RECVERR to opt in to the alternate
semantics.

--Andy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I avoid recvmsg races with IP_RECVERR?

2015-06-02 Thread Hannes Frederic Sowa
On Tue, Jun 2, 2015, at 23:42, Hannes Frederic Sowa wrote:
 On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote:
  On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa
  han...@stressinduktion.org wrote:
   On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote:
 
  [...]
 
  I do this already, which makes me think that there's a bug or another
  race somewhere.  I've only seen a failure once in several years of
  operation.
  
  The failure happened on a ping socket.  I suspect that the race is:
  
  ping_err: ip_icmp_error(...);
  
  user: recvmsg(MSG_ERRQUEUE) and dequeues the error.
  
  ping_err: sk_err = err;
  
  user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the
  error via sock_error.
  
  user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN.
  
  Now the user code thinks that it was a real (non-transient) error and
  aborts.
  
  Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE?
 
 Hmm, I don't think this will help.
 
  Even if this race were fixed, this interface still sucks IMO.
 
 Yes. :/
 
 My proposal would be to make the error conversion lazy:
 
 Keeping duplicate data is not a good idea in general: So we shouldn't
 use sk-sk_err if IP_RECVERR is set at all but let sock_error just use
 the sk_error_queue and extract the error code from there.
 
 Only if IP_RECVERR was not set, we use sk-sk_err logic.
 
 What do you think?

I just noticed that this will probably break existing user space
applications which require that icmp errors are transient even with
IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer
and xchg the pointer, hmmm

Bye,
Hannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops

2015-06-02 Thread nolan

On 06/02/2015 12:44 AM, Scott Feldman wrote:

That brings up an interesting point about having multiple bridges with
the same vlan configured.  I struggled with that problem with rocker
also and I don't have an answer other than don't do that.  Or,
better put, if you have multiple bridge on the same vlan, just use one
bridge for that vlan.  Otherwise, I don't know how at the device level
to partition the vlan between the bridges.  Maybe that's what Vivien
is facing also?  I can see how this works for software-only bridges,
because they should be isolated from each other and independent.  But
when offloading to a device which sees VLAN XXX global across the
entire switch, I don't see how we can preserve the bridge boundaries.


Scott,

I'm confused by this.  I think you're saying this config is problematic:

br0: eth0.100, eth1.100
br1: eth2.100, eth3.100

But this works fine today.

Could you clarify the issue you're referring to?

Thanks,
- nolan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] bpf: introduce bpf_clone_redirect() helper

2015-06-02 Thread Alexei Starovoitov
Allow eBPF programs attached to classifier/actions to call
bpf_clone_redirect(skb, ifindex, flags) helper which will
mirror or redirect the packet by dynamic ifindex selection
from within the program to a target device either at ingress
or at egress. Can be used for various scenarios, for example,
to load balance skbs into veths, split parts of the traffic
to local taps, etc.

Signed-off-by: Alexei Starovoitov a...@plumgrid.com
Acked-by: Daniel Borkmann dan...@iogearbox.net
---
 include/uapi/linux/bpf.h |   10 ++
 net/core/filter.c|   40 
 2 files changed, 50 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 72f3080afa1e..42aa19abab86 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -220,6 +220,16 @@ enum bpf_func_id {
 * Return: 0 on success
 */
BPF_FUNC_tail_call,
+
+   /**
+* bpf_clone_redirect(skb, ifindex, flags) - redirect to another netdev
+* @skb: pointer to skb
+* @ifindex: ifindex of the net device
+* @flags: bit 0 - if set, redirect to ingress instead of egress
+* other bits - reserved
+* Return: 0 on success
+*/
+   BPF_FUNC_clone_redirect,
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index b78a010a957f..64c121c09655 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -46,6 +46,7 @@
 #include linux/seccomp.h
 #include linux/if_vlan.h
 #include linux/bpf.h
+#include net/sch_generic.h
 
 /**
  * sk_filter - run a packet through a socket filter
@@ -1407,6 +1408,43 @@ const struct bpf_func_proto bpf_l4_csum_replace_proto = {
.arg5_type  = ARG_ANYTHING,
 };
 
+#define BPF_IS_REDIRECT_INGRESS(flags) ((flags)  1)
+
+static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5)
+{
+   struct sk_buff *skb = (struct sk_buff *) (long) r1, *skb2;
+   struct net_device *dev;
+
+   dev = dev_get_by_index_rcu(dev_net(skb-dev), ifindex);
+   if (unlikely(!dev))
+   return -EINVAL;
+
+   if (unlikely(!(dev-flags  IFF_UP)))
+   return -EINVAL;
+
+   skb2 = skb_clone(skb, GFP_ATOMIC);
+   if (unlikely(!skb2))
+   return -ENOMEM;
+
+   if (G_TC_AT(skb2-tc_verd)  AT_INGRESS)
+   skb_push(skb2, skb2-mac_len);
+
+   if (BPF_IS_REDIRECT_INGRESS(flags))
+   return dev_forward_skb(dev, skb2);
+
+   skb2-dev = dev;
+   return dev_queue_xmit(skb2);
+}
+
+const struct bpf_func_proto bpf_clone_redirect_proto = {
+   .func   = bpf_clone_redirect,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+   .arg3_type  = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
@@ -1440,6 +1478,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return bpf_l3_csum_replace_proto;
case BPF_FUNC_l4_csum_replace:
return bpf_l4_csum_replace_proto;
+   case BPF_FUNC_clone_redirect:
+   return bpf_clone_redirect_proto;
default:
return sk_filter_func_proto(func_id);
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap

2015-06-02 Thread Eric W. Biederman
Robert Shearman rshea...@brocade.com writes:

 On 02/06/15 22:10, Eric W. Biederman wrote:
 Robert Shearman rshea...@brocade.com writes:

 On 02/06/15 19:11, Eric W. Biederman wrote:
 Robert Shearman rshea...@brocade.com writes:

 In order to be able to function as a Label Edge Router in an MPLS
 network, it is necessary to be able to take IP packets and impose an
 MPLS encap and forward them out. The traditional approach of setting
 up an interface for each tunnel endpoint doesn't scale for the
 common MPLS use-cases where each IP route tends to be assigned a
 different label as encap.

 The solution suggested here for further discussion is to provide the
 facility to define encap data on a per-nexthop basis using a new
 netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6
 forwarding code, but interpreted by the virtual interface assigned to
 the nexthop.

 A new ipmpls interface type is defined to show the use of this
 facility to allow IP packets to be imposed with an MPLS
 encap. However, the facility is designed to be general enough to be
 used by any encapsulation/tunneling mechanism that has similar
 requirements of high-scale, high-variation-of-encap.

 I am still digging into the details but adding a new network device to
 make this possible if very undesirable.

 It is a pain point.  Those network devices get to be a major source of
 memory consumption when there are 4K network namespaces in existence.

 It is conceptually wrong.  The network device will never be used as an
 ordinary network device.  All the network device gives you is the
 ability to avoid creating an enumeration of different kinds of
 encapsulation.

 This isn't true. The network device also gives some of the things you
 take for granted. Things like fragmentation through specifying the mtu
 on the shared tunnel device, being able to specify rules using the
 shared tunnel output device, IP stats, and the ability specify a
 different destination namespace.

 Granted you get a few more things.  It is still conceptually wrong as
 the network device will netver be used as an ordinary network device.

 Fragmentation is already silly because we are talking about multiple
 tunnels with different properties.  You need per-route mtu to handle
 that case.

 It's unlikely you'll have a huge variation in the mtus across routes,
 unless you're running in an ISP environment. In the example uses we've
 got in hand, it's highly likely they'll only be a handful of different
 mtus, if that.

Did we ever implement an mpls mtu per netdevice (I think so).
Anyway the tunnel mtu is easy enough to calculate in context (base mtu -
tunnel overhead).  So for default we should not need to do much.

 Further I am not saying you don't need an output device (which is what
 is needed to specify a different destination namespace) I am saying that
 having a funny mpls device is wrong as far as I can see.  Certainly it
 is a lot of bloody unnecessary overhead.

 If we are going to design for maximum scaling (and 1 million+ routes)
 sounds like maximum scaling we should see how far we can go without
 dragging in the horrible heaviness of additional network devices.  35K a
 piece last I measured it.  Just a small handful of them are already
 scaling issues for network namespaces.

 For the ipmpls interface I've implemented here, you only need one per
 namespace. You could argue the same for the veth interfaces which
 would be much more commonly used in network namespaces.

But if I can avoid the extra 143M (35Kibibytes*4096namespaces) I would like to.

On the drawing board is getting cross namespace routes so with a little
luck I will only need loopback devices in most of my network namespaces
when the dust settles.

Outputing to network devices in another network namespace is
fundamentally simple but I haven't take the time to figure out which
assumptions I may have to purge to make it work reliably.

 BTW, maybe I've missed something, or maybe netdevs have gone on a
 diet, but I count the cost of creating a basic interface at ~2700
 bytes on x86_64:
 sizeof(struct net_device) /* 2112 */ + 1 * sizeof(struct netdev_queue)
 /* 384 */ + 1 * sizeof(struct netdev_rx_queue) /* 128 */ +
 sizeof(struct netdev_hw_addr) /* 80 */ + sizeof(int) * nr_poss_cpus /*
 4 * n */)

It is a non-trivial thing to measure.  You really have to create a lot
of them and see how much memory is consumed.  But between the per cpu
stats, the sysctl attributes, the sysfs attribute and everything else
an actual working netdevice in an all yes config kernel was consuming
something like 35K not too long ago.

Eric

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 7/9] net: dsa: mv88e6352: lock CPU port from learning addresses

2015-06-02 Thread Guenter Roeck

On 06/02/2015 07:31 PM, Chris Healy wrote:

Guenter,

That's a very valid concern.  I have a configuration with a 6352 controlled by 
a low end ARM core with a 100mbps connection on the CPU port.  This switch 
needs to support passing multicast streams that are more than 100mbps on GigE 
links.  (The ARM does not need to consume the multicast, it just manages the 
switch.)



Possibly, but Vivien didn't answer my question (how the local SA address finds
its way into the switch fdb). I'll check it myself.

Thanks,
Guenter


On Jun 3, 2015 3:24 AM, Guenter Roeck li...@roeck-us.net 
mailto:li...@roeck-us.net wrote:

On Tue, Jun 02, 2015 at 09:06:15PM -0400, Vivien Didelot wrote:
  Hi Guenter,
 
  On Jun 2, 2015, at 10:24 AM, Guenter Roeck li...@roeck-us.net 
mailto:li...@roeck-us.net wrote:
  On 06/01/2015 06:27 PM, Vivien Didelot wrote:
   This commit disables SA learning and refreshing for the CPU port.
  
  
   Hi Vivien,
  
   This patch also seems to be unrelated to the rest of the series.
  
   Can you add an explanation why it is needed ?
  
   With this in place, how does the CPU port SA find its way into the fdb 
?
   Do we assume that it will be configured statically ?
   An explanation might be useful.
 
  Without this patch, I noticed the CPU port was stealing the SA of a PC
  behind a switch port. this happened when the port was a bridge member,
  as the bridge was relaying broadcast coming from one switch port to the
  other switch ports in the same vlan.
 
Makes me feel really uncomfortable. I think we may be going into the wrong
direction. The whole point of offloading bridging is to have the switch 
handle
forwarding, and that includes multicasts and broadcasts. Instead of doing 
that,
it looks like we put more and more workarounds in place.

Maybe the software bridge code needs to understand that it isn't support to
forward broadcasts to ports of an offloaded bridge, and we should let the
switch chip handle it ?

Thanks,
Guenter



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops

2015-06-02 Thread Vivien Didelot
Guenter,

On Jun 2, 2015, at 2:50 AM, Guenter Roeck li...@roeck-us.net wrote:
 On 06/01/2015 06:27 PM, Vivien Didelot wrote:
 +/* Bringing an interface up adds it to the VLAN 0. Ignore this. */
 +if (!vid)
 +return 0;
 +
 
 Me puzzled ;-). I brought this and the fid question up before.
 No idea if my e-mail got lost or what happened.
 
 Can you explain why we don't need a configuration for vlan 0 ?

Sorry for late reply. Initially, when issuing ip link set up dev swp0,
ndo_vlan_rx_add_vid was called to add the interface in the VLAN 0.

2 things happen here:

  * this is inconsistent with the bridge vlan output which doesn't seem to
know about a VID 0;
  * VID 0 seems special for this switch: if an ingressing frame has VID 0, the
tagged port will override the VID bits with the port default VID at egress.

Thanks,
-v
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 6/9] net: dsa: mv88e6352: allow egress of unknown multicast

2015-06-02 Thread Vivien Didelot
Hi Guenter,

On Jun 2, 2015, at 10:20 AM, Guenter Roeck li...@roeck-us.net wrote:
 On 06/01/2015 06:27 PM, Vivien Didelot wrote:
 This patch disables egress of unknown unicast destination addresses.

 
 Hi Vivien,
 
 seems to me this patch is unrelated to the rest of the series.
 
 Not sure if we really want this. If an address is in the arp cache
 but has timed out from the bridge database, any unicast to that address
 will no longer be sent. If the bridge database has been flushed for some
 reason, such as a spanning tree reconfiguration, we'll have a hard time
 to send anything.
 
 What is the problem you are trying to solve with this patch ?

TBH, I don't remember which one of the test cases I described in 0/9
this patch was solving... Some ARP request didn't propagate correctly
without this, IIRC.

I'll try to revert the change and do my tests again in order to isolate
the problem.

Thanks,
-v
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 7/9] net: dsa: mv88e6352: lock CPU port from learning addresses

2015-06-02 Thread Guenter Roeck
On Tue, Jun 02, 2015 at 09:06:15PM -0400, Vivien Didelot wrote:
 Hi Guenter,
 
 On Jun 2, 2015, at 10:24 AM, Guenter Roeck li...@roeck-us.net wrote:
 On 06/01/2015 06:27 PM, Vivien Didelot wrote:
  This commit disables SA learning and refreshing for the CPU port.
 
  
  Hi Vivien,
  
  This patch also seems to be unrelated to the rest of the series.
  
  Can you add an explanation why it is needed ?
  
  With this in place, how does the CPU port SA find its way into the fdb ?
  Do we assume that it will be configured statically ?
  An explanation might be useful.
 
 Without this patch, I noticed the CPU port was stealing the SA of a PC
 behind a switch port. this happened when the port was a bridge member,
 as the bridge was relaying broadcast coming from one switch port to the
 other switch ports in the same vlan.
 
Makes me feel really uncomfortable. I think we may be going into the wrong
direction. The whole point of offloading bridging is to have the switch handle
forwarding, and that includes multicasts and broadcasts. Instead of doing that,
it looks like we put more and more workarounds in place.

Maybe the software bridge code needs to understand that it isn't support to
forward broadcasts to ports of an offloaded bridge, and we should let the
switch chip handle it ?

Thanks,
Guenter
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 5/9] net: dsa: mv88e6352: disable mirroring

2015-06-02 Thread Guenter Roeck
On Tue, Jun 02, 2015 at 09:12:30PM -0400, Vivien Didelot wrote:
 Hi Guenter, Andrew,
 
 On Jun 2, 2015, at 10:53 AM, Andrew Lunn and...@lunn.ch wrote:
 On Tue, Jun 02, 2015 at 07:16:10AM -0700, Guenter Roeck wrote:
  On 06/01/2015 06:27 PM, Vivien Didelot wrote:
  Disable the mirroring policy in the monitor control register, since this
  feature is not needed.
  
  Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com
  
  Should this be a separate patch, unrelated to the patch set ?
 
 Indeed, this one is an unrelated patch, sorry.
 
  If I understand correctly, this effectively disables IGMP/MLD snooping.
  I think this warrants an explanation why that it not needed, not just
  a statement that it is not needed.
  
  +1
  
  Especially since we might want to revisit this to implement IGMP/MLD
  snooping in the bridge. The hardware should be capable of it.
 
 This is something I want to disable because I can have several times
 gigabit traffic on my ports. This would end up in a bottleneck on the
 CPU port. Am I right?
 
Not really. That should not be that much traffic. Besides, IGMP/MLD snooping
still needs to be enabled separately, as well as egress monitoring.

I don't think this has any impact on the traffic to the CPU port unless other
configuration bits are set as well.

Guenter
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Intel-wired-lan] [PATCH V2 1/2] pci: Add dev_flags bit to access VPD through function 0

2015-06-02 Thread Alexander Duyck

On 06/02/2015 05:10 PM, Mark D Rustad wrote:

Add a dev_flags bit, PCI_DEV_FLAGS_VPD_REF_F0, to access VPD through
function 0 to provide VPD access on other functions. This solves
concurrent access problems on many devices without changing the
attributes exposed in sysfs. Never set this bit on function 0 or
there will be an infinite recursion.

Signed-off-by: Mark Rustad mark.d.rus...@intel.com
---
Changes in V2:
- Corrected spelling in log message
- Added checks to see that the referenced function 0 is reasonable
---
  drivers/pci/access.c |   48 +++-
  1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index d9b64a175990..74634d4868a2 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -439,6 +439,40 @@ static const struct pci_vpd_ops pci_vpd_pci22_ops = {
.release = pci_vpd_pci22_release,
  };
  
+static ssize_t pci_vpd_f0_read(struct pci_dev *dev, loff_t pos, size_t count,

+  void *arg)
+{
+   struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn));
+   ssize_t ret;
+
+   if (!tdev)
+   return -ENODEV;
+
+   ret = pci_read_vpd(tdev, pos, count, arg);
+   pci_dev_put(tdev);
+   return ret;
+}
+
+static ssize_t pci_vpd_f0_write(struct pci_dev *dev, loff_t pos, size_t count,
+   const void *arg)
+{
+   struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn));
+   ssize_t ret;
+
+   if (!tdev)
+   return -ENODEV;
+
+   ret = pci_write_vpd(tdev, pos, count, arg);
+   pci_dev_put(tdev);
+   return ret;
+}
+
+static const struct pci_vpd_ops pci_vpd_f0_ops = {
+   .read = pci_vpd_f0_read,
+   .write = pci_vpd_f0_write,
+   .release = pci_vpd_pci22_release,
+};
+
  int pci_vpd_pci22_init(struct pci_dev *dev)
  {
struct pci_vpd_pci22 *vpd;
@@ -447,12 +481,24 @@ int pci_vpd_pci22_init(struct pci_dev *dev)
cap = pci_find_capability(dev, PCI_CAP_ID_VPD);
if (!cap)
return -ENODEV;
+   if (dev-dev_flags  PCI_DEV_FLAGS_VPD_REF_F0) {
+   struct pci_dev *tdev;
+
+   tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn));
+   if (!tdev || !dev-multifunction || !tdev-multifunction ||
+   dev-class != tdev-class || dev-vendor != tdev-vendor ||
+   dev-device != tdev-device)
+   return -ENODEV;
+   }


You can probably combine the dev-multifunction check with the dev_flags 
check.  After all you don't need this workaround if the device is not 
multifunction.  It might even make more sense to move the multifunction 
check to the quirk in patch 2/2.


I also believe this leaks a reference to the device.  You should be 
calling pci_dev_put(tdev) if tdev is not NULL.  As such you probably 
need to split up the !tdev and the rest of the checks.


- Alex
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] net: change fib behavior based on interface link status

2015-06-02 Thread Andy Gospodarek
This patch adds the ability to have the Linux kernel track whether or
not a particular route should be used based on the link-status of the
interface associated with the next-hop.

Before this patch any link-failure on an interface that was serving as a
gateway for some systems could result in those systems being isolated
from the rest of the network as the stack would continue to attempt to
send frames out of an interface that is actually linked-down.  When the
kernel is responsible for all forwarding, it should also be responsible
for taking action when the traffic can no longer be forwarded -- there
is no real need to outsource link-monitoring to userspace anymore.

This feature is only enabled with the new sysctl set (default is off):
net.core.kill_routes_on_linkdown = 1

When this is set, the following behavior can be observed (interface p8p1
is link-down):

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
# ip route get 90.0.0.1 
90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
cache 
# ip route get 80.0.0.1 
local 80.0.0.1 dev lo  src 80.0.0.1 
cache local 
# ip route get 80.0.0.2
80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
cache 

While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down.  Now interface p8p1 is linked-up:

# ip route show 
default via 10.0.5.2 dev p9p1 
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
# ip route get 90.0.0.1 
90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
cache 
# ip route get 80.0.0.1 
local 80.0.0.1 dev lo  src 80.0.0.1 
cache local 
# ip route get 80.0.0.2
80.0.0.2 dev p8p1  src 80.0.0.1 
cache 

and the output changes to what one would expect.

Signed-off-by: Andy Gospodarek go...@cumulusnetworks.com
Suggested-by: Dinesh Dutt dd...@cumulusnetworks.com

---
Though there were some that preferred not to have a configuration option
and to make this behavior the default when it was discussed in Ottawa
earlier this year since it was time to do this.  I wanted to propose
the config option to preserve the current behavior for those that desire
it.  I'll happily remove it if Dave and Linus approve.

An IPv6 implementation is also needed (DECnet too!), but I wanted to
start with the IPv4 implementation to get people comfortable with the
idea before moving forward.  If this is accepted the IPv6 implementation
can be posted shortly.  

FWIW, we have been running this patch with the sysctl setting above and
our customers have been happily using a backported version for IPv4 and
IPv6 for 6 months.

 include/linux/netdevice.h  |  1 +
 include/net/fib_rules.h|  1 +
 include/net/ip_fib.h   |  1 +
 include/uapi/linux/rtnetlink.h |  1 +
 include/uapi/linux/sysctl.h|  1 +
 kernel/sysctl_binary.c |  1 +
 net/core/dev.c |  2 ++
 net/core/sysctl_net_core.c |  7 +++
 net/ipv4/fib_frontend.c| 12 +--
 net/ipv4/fib_rules.c   |  7 ++-
 net/ipv4/fib_semantics.c   | 46 --
 net/ipv4/fib_trie.c| 19 +
 12 files changed, 86 insertions(+), 13 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6f5f71f..5bd953c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2986,6 +2986,7 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb);
 bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb);
 
 extern int netdev_budget;
+extern int kill_routes_on_linkdown;
 
 /* Called by rtnetlink.c:rtnl_unlock() */
 void netdev_run_todo(void);
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 6d67383..4fbfda5 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -37,6 +37,7 @@ struct fib_lookup_arg {
struct fib_rule *rule;
int flags;
 #define FIB_LOOKUP_NOREF   1
+#define FIB_LOOKUP_ALLOWDEAD   2
 };
 
 struct fib_rules_ops {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 54271ed..efb195b 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -250,6 +250,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id);
 struct fib_table *fib_get_table(struct net *net, u32 id);
 
 int __fib_lookup(struct net 

  1   2   >