[PATCH RFC 1/2] cdc_ncm: add the currently processed NDP frame to global driver data
This is useful to split up the cdc_ncm_ndp function later on. The resulting code will be anyway stateful. Signed-Off-By: Enrico Mioso mrkiko...@gmail.com --- include/linux/usb/cdc_ncm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/usb/cdc_ncm.h b/include/linux/usb/cdc_ncm.h index 7c9b484..9172256 100644 --- a/include/linux/usb/cdc_ncm.h +++ b/include/linux/usb/cdc_ncm.h @@ -100,6 +100,7 @@ struct cdc_ncm_ctx { struct sk_buff *tx_curr_skb; struct sk_buff *tx_rem_skb; __le32 tx_rem_sign; + struct usb_cdc_ncm_ndp16 *tx_curr_ndp16; spinlock_t mtx; atomic_t stop; -- 2.4.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 0/2] cdc_ncm refactoring
I changed my mind, and decided to try in following this new way. This series splits the cdc_ncm_ndp function in two parts: - one that finds NDP blocks already present in the SKB being sent out - one that pushes new ones, starting from where the _find function left. After this splitting it seems more easy to modify the location where the NDP is disposed. What do you think about this? From now on, I need a little bit of help: I think we might work on the cdc_ncm_ndp16_push function, still I am open to any suggestion. Let me know if you like this. Enrico Enrico Mioso (2): cdc_ncm: add the currently processed NDP frame to global driver data cdc_ncm: split the cdc_ncm_ndp funciton drivers/net/usb/cdc_ncm.c | 30 +- include/linux/usb/cdc_ncm.h | 1 + 2 files changed, 22 insertions(+), 9 deletions(-) -- 2.4.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 5/5] rocker: remove support for legacy VLAN ndo ops
On Mon, Jun 1, 2015 at 10:24 PM, David Miller da...@davemloft.net wrote: From: Toshiaki Makita makita.toshi...@lab.ntt.co.jp Date: Tue, 02 Jun 2015 13:51:06 +0900 On 2015/06/02 3:39, sfel...@gmail.com wrote: From: Scott Feldman sfel...@gmail.com Remove support for legacy ndo ops .ndo_vlan_rx_add_vid/.ndo_vlan_rx_kill_vid. Rocker will use bridge_setlink/dellink exclusively for VLAN add/del operations. The legacy ops are needed if using 8021q driver module to setup VLANs on the port. But an alternative exists in using bridge_setlink/delink to setup VLANs, which doesn't depend on 8021q module. So rocker will switch to the newer setlink/dellink ops. VLANs can added/delete from the port, regardless if port is bridged or not, using the bridge commands: bridge vlan [add|del] vid VID dev DEV self Hi Scott, This doesn't look transparent with bridge. Before this patch, I was able to add vid in the same way as software bridge: ip link set DEV master br0 bridge vlan add vid VID dev DEV Now I need to add self, which is different from software bridge... I'm already not liking the looks of this Actually, we're now consistent with bridge man page which says master is the default. Want we want, I believe, is to adjust what the man page says (and the bridge vlan command itself), by making the default master and self. The kernel and driver are fine, it's the default in the bridge command that needs adjusting. Once we do this, we'll be back to transparent with software-only bridge. How did we get here? So the RTM_SETLINK for PF_BRIDGE calls rtnl_bridge_setlink(). rtnl_bridge_setlink() calls ndo_bridge_setlink for the master (the bridge side of the port) and self (the device side of the port), depending on if MASTER and/or SELF flags are set. Since the default from the iproute2 bridge vlan cmd is to only set MASTER, only the bridge's ndo_bridge_setlink is called. But if you dig down into the bridge's ndo_bridge_setlink, you'll see it will call into the port driver's ndo_vlan_rx_add_vid() to add the vlan to the device side of the port. So we have a MASTER cmd that is doing some SELF work. My guess this was done to avoid having to update all the NIC drivers from ndo_vlan_rx_add_vid to ndo_bridge_setlink. When you remove ndo_vlan_rx_add_vid() from the port driver, the cmd needs to target MASTER and SELF for both sides of the port to be called. But the current cmd only sets MASTER. This is why you (currently) need to add SELF for cmd to target the device side of the port. On top of all of this, you can use RTM_SETLINK for PF_BRIDGE on non-bridged ports, in which case only SELF is used to program the VLAN on the device, using the device's ndo_bridge_setlink. This is the confusing part where you can set VLANs on non-bridged ports using the bridge cmd. To summarize, pseudo code for rtnl_bridge_setlink() is: rtnl_bridge_setlink() if MASTER call bridge's ndo_bridge_setlink() if bridge port implements ndo_vlan_rx_add_vid() call ndo_vlan_rx_add_vid() on port device to set vlan if SELF call port device's ndo_bridge_setlink() If DEV is bridged, today we have: bridge vlan add vid VID dev DEV sets MASTER (default) bridge vlan add vid VID dev DEV master sets MASTER bridge vlan add vid VID dev DEV selfsets SELF bridge vlan add vid VID dev DEV master self sets MASTER and SELF if DEV is not bridged, today we have: bridge vlan add vid VID dev DEV // fails (no master device) bridge vlan add vid VID dev DEV selfsets SELF What I propose is we change the bridge vlan cmd for the DEV bridged case as such: bridge vlan add vid VID dev DEV sets MASTER and SELF (default) bridge vlan add vid VID dev DEV master sets MASTER bridge vlan add vid VID dev DEV selfsets SELF bridge vlan add vid VID dev DEV master self sets MASTER and SELF For existing users of ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid, nothing really changes. If they also have an ndo_bridge_setlink, it'll get called but they're not doing any vlan stuff there today anyway, so it's ignored. For rocker, we're switching to doing all vlan stuff in ndo_bridge_setlink. Switching to ndo_bridge_setlink for switchdev gives us support for stacked drivers with the transaction model, something we don't get with ndo_vlan_rx_add_vid. If this makes sense, I'll post the follow up bridge vlan cmd change to default to master and self. -scott -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] mac80211: Switch to new AEAD interface
On Mon, Jun 01, 2015 at 05:36:58PM +0200, Stephan Mueller wrote: Am Montag, 1. Juni 2015, 16:35:26 schrieb Johannes Berg: IOW, I think something like this would make sense: That looks definitely cleaner :-) Indeed.. That AAD length-in-the-buffer design came from the over ten year old code that was optimized to cover the CCM construction with the same buffer and that was not cleaned up when this was converted to use cryptoapi couple of years ago. Though, my main concern was just to ensure that the aad length value is not zero. It won't be in IEEE 802.11 use cases. The exact length depends on the IEEE 802.11 frame type, but AAD is constructed in a way that it is normally a bit over 20 octets while allowing CCM to fit the related operations into two AES blocks. -- Jouni MalinenPGP id EFC895FA -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] net: socket: Fix the wrong returns for recvmsg and sendmsg
On 2015/6/2 14:52, Willy Tarreau wrote: On Tue, Jun 02, 2015 at 02:43:54PM +0800, Junling Zheng wrote: On 2015/6/2 14:27, Greg KH wrote: On Mon, Jun 01, 2015 at 10:23:57PM -0700, David Miller wrote: From: Junling Zheng zhengjunl...@huawei.com Date: Tue, 2 Jun 2015 12:05:32 +0800 So, the problem commit is 281c9c36 (net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour), which fixes db31c55a6fb2 and brings the get_compat_msghdr() in line with copy_msghdr_from_user(). Upstream this got fixed by: 08adb7dabd4874cc5666b4490653b26534702ce0 So the part that makes us not unconditionally return -EFAULT needs to be backported, and that's probably equivalent to the patch your proposed which therefore should be applied. Ok, thanks, now applied. Maybe other stable version also needs this fix:) Yes, from what I'm seeing, at least 3.2 and 2.6.32 need it as well. Yeah, all other stable versions *except 3.19 and 4.0* may need this fix:) Thanks, Willy . -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops
Vivien, On 06/01/2015 06:27 PM, Vivien Didelot wrote: This commit implements the port_vlan_add and port_vlan_del functions in the dsa_switch_driver structure for Marvell 88E6xxx compatible switches. This allows to access a switch VLAN Table Unit, and thus define VLANs from standard userspace commands such as bridge vlan. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com --- [ ... ] + +int mv88e6xxx_port_vlan_add(struct dsa_switch *ds, int port, u16 vid, + u16 bridge_flags) +{ + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + struct mv88e6xxx_vtu_entry entry = { 0 }; + int prev_vid = vid ? vid - 1 : 4095; + int i, ret; + + /* Bringing an interface up adds it to the VLAN 0. Ignore this. */ + if (!vid) + return 0; + Me puzzled ;-). I brought this and the fid question up before. No idea if my e-mail got lost or what happened. Can you explain why we don't need a configuration for vlan 0 ? + /* The DSA port-based VLAN setup reserves FID 0 to DSA_MAX_PORTS; +* we will use the next FIDs for 802.1q; +* thus, forbid the last DSA_MAX_PORTS VLANs. +*/ + if (vid 4095 - DSA_MAX_PORTS) + return -EINVAL; + + mutex_lock(ps-smi_mutex); + ret = _mv88e6xxx_vtu_getnext(ds, prev_vid, entry); + if (ret 0) + goto unlock; + + /* If the VLAN does not exist, re-initialize the entry for addition */ + if (entry.vid != vid || !entry.valid) { + memset(entry, 0, sizeof(entry)); + entry.valid = true; + entry.vid = vid; + entry.fid = DSA_MAX_PORTS + vid; I brought this up before. No idea if my e-mail got lost or what happened. We use a fid per port, and a fid per bridge group. With VLANs, this is completely ignored, ahd there is only a single fid per vlan for the entire switch. Either per-port fids are unnecessary as well, or something is wrong here, or I am missing something. Can you explain why we only need a single fid per vlan, even if we have multiple bridge groups and the same vlan is configured in all of them ? Thanks, Guenter -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/9] net: dsa: add basic support for VLAN operations
On 06/01/2015 06:27 PM, Vivien Didelot wrote: This patch adds the glue between DSA and switchdev to add and delete SWITCHDEV_OBJ_PORT_VLAN objects. This will allow the DSA switch drivers implementing the port_vlan_add and port_vlan_del functions to access the switch VLAN database through userspace commands such as bridge vlan. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com --- include/net/dsa.h | 7 +++ net/dsa/slave.c | 61 +-- 2 files changed, 66 insertions(+), 2 deletions(-) diff --git a/include/net/dsa.h b/include/net/dsa.h index fbca63b..726357b 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -302,6 +302,13 @@ struct dsa_switch_driver { const unsigned char *addr, u16 vid); int (*fdb_getnext)(struct dsa_switch *ds, int port, unsigned char *addr, bool *is_static); + + /* +* VLAN support +*/ + int (*port_vlan_add)(struct dsa_switch *ds, int port, u16 vid, +u16 bridge_flags); + int (*port_vlan_del)(struct dsa_switch *ds, int port, u16 vid); }; void register_switch_driver(struct dsa_switch_driver *type); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index cbda00a..52ba5a1 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -363,6 +363,25 @@ static int dsa_slave_port_attr_set(struct net_device *dev, return ret; } +static int dsa_slave_port_vlans_add(struct net_device *dev, + struct switchdev_obj_vlan *vlan) +{ + struct dsa_slave_priv *p = netdev_priv(dev); + struct dsa_switch *ds = p-parent; + int vid, err = 0; + + if (!ds-drv-port_vlan_add) + return -ENOTSUPP; + + for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) { + err = ds-drv-port_vlan_add(ds, p-port, vid, vlan-flags); + if (err) + break; + } + + return err; +} + static int dsa_slave_port_obj_add(struct net_device *dev, struct switchdev_obj *obj) { @@ -378,6 +397,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev, return 0; switch (obj-id) { + case SWITCHDEV_OBJ_PORT_VLAN: + err = dsa_slave_port_vlans_add(dev, obj-u.vlan); + break; default: err = -ENOTSUPP; break; @@ -386,12 +408,34 @@ static int dsa_slave_port_obj_add(struct net_device *dev, return err; } +static int dsa_slave_port_vlans_del(struct net_device *dev, + struct switchdev_obj_vlan *vlan) +{ + struct dsa_slave_priv *p = netdev_priv(dev); + struct dsa_switch *ds = p-parent; + int vid, err = 0; + + if (!ds-drv-port_vlan_del) + return -ENOTSUPP; + + for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) { + err = ds-drv-port_vlan_del(ds, p-port, vid); + if (err) + break; + } + + return err; +} + static int dsa_slave_port_obj_del(struct net_device *dev, struct switchdev_obj *obj) { int err; switch (obj-id) { + case SWITCHDEV_OBJ_PORT_VLAN: + err = dsa_slave_port_vlans_del(dev, obj-u.vlan); + break; default: err = -EOPNOTSUPP; break; @@ -473,6 +517,15 @@ static netdev_tx_t dsa_slave_notag_xmit(struct sk_buff *skb, return NETDEV_TX_OK; } +static int dsa_slave_vlan_noop(struct net_device *dev, __be16 proto, u16 vid) +{ + /* NETIF_F_HW_VLAN_CTAG_FILTER requires ndo_vlan_rx_add_vid and +* ndo_vlan_rx_kill_vid, otherwise the VLAN acceleration is considered +* buggy (see net/core/dev.c). +*/ As Scott mentioned, just don't set NETIF_F_HW_VLAN_CTAG_FILTER. I don't entirely understand why we would not want to filter VLANs in the switch. Can you explain ? Thanks, Guenter -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 01/14] sfc: Add code to export port_num in netdev-dev_port
On 01/06/15 20:01, David Miller wrote: From: Shradha Shah ss...@solarflare.com Date: Mon, 1 Jun 2015 14:00:12 +0100 In the case where we have multiple functions (PFs and VFs), this sysfs entry is useful to identify the physical port corresponding to the function we are interested in. Signed-off-by: Shradha Shah ss...@solarflare.com This is a low effort change. You retained all of the error handling changes that were only necessary when you added the new sysfs file, but are completely unnecessary if you're just reporting it via netdev-dev_port. With the addition of the sysfs change in my previous version, the error handling code required the addition of a fail4 tag to deal with the sysfs file on the error path. Without the sysfs file in my current version v2, there is no extra fail4 tag, I have reverted back to using fail3. The changes that are seen in the patch are stylistic changes following the rule that every branch of an if statement should have parenthesis if one of the branch uses parenthesis. The previous version of the patch touched this bit of code so the style change was relevant. I think my mistake here is that I left it in v2 as a style change, but I can assure you that I looked at the error path before submitting the patch and also that this patch does not affect the error path. Maybe I should have separated the style change to go as a different patch. I will do so now and submit a v3. Thanks. This is extremely disappointing, because you expect me to put a good effort into reviewing your changes yet you aren't putting that level of effort into the submission itself. -- Many Thanks, Regards, Shradha Shah -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
linux-next: manual merge of the scsi tree with the net-next tree
Hi James, Today's linux-next merge of the scsi tree got a conflict in drivers/target/target_core_user.c between commit 5538d294dd66 (treewide: Add missing vmalloc.h inclusion) from the net-next tree and commit 7ad09a15e76b (target: Minimize SCSI header #include directives) from the scsi tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc drivers/target/target_core_user.c index edc98250,21b438ec4700.. --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@@ -19,13 -19,13 +19,14 @@@ #include linux/spinlock.h #include linux/module.h #include linux/idr.h + #include linux/kernel.h #include linux/timer.h #include linux/parser.h +#include linux/vmalloc.h - #include scsi/scsi.h - #include scsi/scsi_host.h #include linux/uio_driver.h #include net/genetlink.h + #include scsi/scsi_common.h + #include scsi/scsi_proto.h #include target/target_core_base.h #include target/target_core_fabric.h #include target/target_core_backend.h pgp5ivnV19BG0.pgp Description: OpenPGP digital signature
Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops
On Mon, Jun 1, 2015 at 11:50 PM, Guenter Roeck li...@roeck-us.net wrote: [cut] I brought this up before. No idea if my e-mail got lost or what happened. We use a fid per port, and a fid per bridge group. With VLANs, this is completely ignored, ahd there is only a single fid per vlan for the entire switch. Either per-port fids are unnecessary as well, or something is wrong here, or I am missing something. Can you explain why we only need a single fid per vlan, even if we have multiple bridge groups and the same vlan is configured in all of them ? That brings up an interesting point about having multiple bridges with the same vlan configured. I struggled with that problem with rocker also and I don't have an answer other than don't do that. Or, better put, if you have multiple bridge on the same vlan, just use one bridge for that vlan. Otherwise, I don't know how at the device level to partition the vlan between the bridges. Maybe that's what Vivien is facing also? I can see how this works for software-only bridges, because they should be isolated from each other and independent. But when offloading to a device which sees VLAN XXX global across the entire switch, I don't see how we can preserve the bridge boundaries. I hope I'm not misunderstanding the issue here; if I am, I apologize. -scott -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] be2net breaks when dma_alloc_coherent memory is not zeroed out
Hi, yesterday I bisected an issue with one of my be2net adapters and AMD IOMMU enabled. In 4.1-rc it suddenly broke and didn't initialize anymore. It turned out that the be2net driver breaks when the memory returned from dma_alloc_coherent is not zeroed out. I introduced that change to the AMD IOMMU driver for v4.1, other DMA-API implementations for x86 still zero out the memory. The bug shows like this in dmesg: be2net :02:00.0: FW config: function_mode=0x10003, function_caps=0x7 be2net :02:00.0: FW not responding be2net :02:00.0: Unrecoverable Error detected in the adapter be2net :02:00.0: Please reboot server to recover be2net :02:00.0: UE: MPU bit set or sometimes as: be2net :02:00.1: Waiting for POST, 52s elapsed be2net :02:00.1: Waiting for POST, 54s elapsed be2net :02:00.1: Waiting for POST, 56s elapsed be2net :02:00.1: Waiting for POST, 58s elapsed But always the result is: be2net :02:00.1: Emulex OneConnect(be3) initialization failed be2net: probe of :02:00.1 failed with error -110 When the memory returned by dma_alloc_coherent is zeroed out everything works fine. But strictly speaking dma_alloc_coherent is not required to zero out the memory, drivers need to call dma_zalloc_coherent when they need this. So the behavior of the AMD IOMMU driver is correct. Can you guys please have a look and remove the assumption that dma_alloc_coherent returns initialized memory in the be2net driver? In the future I'd like to optimize out this needless zeroing out of memory from all IOMMU drivers. Please let me know if you need further information or if I can help with testing or anything. Thanks, Joerg -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] net/mlx4_core: Fix build failure introduced by the EQ pool changes
When CONFIG_RFS_ACCEL or SMP aren't set, we fail to build, fix it. Also, avoid build warning as of unused function on that setup. Fixes: c66fa19c405a ('net/mlx4: Add EQ pool') Reported-by: Michael Ellerman m...@ellerman.id.au Signed-off-by: Matan Barak mat...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/net/ethernet/mellanox/mlx4/eq.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c index 1116882..aae13ad 100644 --- a/drivers/net/ethernet/mellanox/mlx4/eq.c +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c @@ -221,6 +221,7 @@ static void mlx4_slave_event(struct mlx4_dev *dev, int slave, slave_event(dev, slave, eqe); } +#if defined(CONFIG_SMP) static void mlx4_set_eq_affinity_hint(struct mlx4_priv *priv, int vec) { int hint_err; @@ -234,6 +235,7 @@ static void mlx4_set_eq_affinity_hint(struct mlx4_priv *priv, int vec) if (hint_err) mlx4_warn(dev, irq_set_affinity_hint failed, err %d\n, hint_err); } +#endif int mlx4_gen_pkey_eqe(struct mlx4_dev *dev, int slave, u8 port) { @@ -1207,8 +1209,8 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE, 0, priv-eq_table.eq[MLX4_EQ_ASYNC]); } else { -#ifdef CONFIG_RFS_ACCEL struct mlx4_eq *eq = priv-eq_table.eq[i]; +#ifdef CONFIG_RFS_ACCEL int port = find_first_bit(eq-actv_ports.ports, dev-caps.num_ports) + 1; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Bluetooth: Add reset_resume function
On Mon, 2015-06-01 at 18:14 -0700, Laura Abbott wrote: Bluetooth devices off of some buses such as USB may lose power across suspend/resume. When this happens, drivers may need to have the setup function called again and behave differently than a cold power on. Yes, but what is the point? We use reset_resume() to retain some features of a device across a loss of power. If power is lost, all settings are gone and all connections are broken. So what is the difference compared to a plug out/in cycle? Regards Oliver -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 2/2] cdc_ncm: split the cdc_ncm_ndp funciton
Split this function in two new ones: - cdc_ncm_ndp16_find: finds an NDP block in the chain mathcing a supplied signature; a pointer to it is returned in case of success; - cdc_ncm_ndp16_push: create and add to skb a new NDP block; cdc_ncm_ndp16_push refers to the last NDP visited by cdc_ncm_ndp16_find, hence this code is stateful. Signed-Off-By: Enrico Mioso mrkiko...@gmail.com --- drivers/net/usb/cdc_ncm.c | 30 +- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c index 8067b8f..3c837d6 100644 --- a/drivers/net/usb/cdc_ncm.c +++ b/drivers/net/usb/cdc_ncm.c @@ -980,7 +980,7 @@ static void cdc_ncm_align_tail(struct sk_buff *skb, size_t modulus, size_t remai /* return a pointer to a valid struct usb_cdc_ncm_ndp16 of type sign, possibly * allocating a new one within skb */ -static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp(struct cdc_ncm_ctx *ctx, struct sk_buff *skb, __le32 sign, size_t reserve) +static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp16_find(struct cdc_ncm_ctx *ctx, struct sk_buff *skb, __le32 sign) { struct usb_cdc_ncm_ndp16 *ndp16 = NULL; struct usb_cdc_ncm_nth16 *nth16 = (void *)skb-data; @@ -988,12 +988,20 @@ static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp(struct cdc_ncm_ctx *ctx, struct sk_ /* follow the chain of NDPs, looking for a match */ while (ndpoffset) { - ndp16 = (struct usb_cdc_ncm_ndp16 *)(skb-data + ndpoffset); - if (ndp16-dwSignature == sign) - return ndp16; + ctx-tx_curr_ndp16 = (struct usb_cdc_ncm_ndp16 *)(skb-data + ndpoffset); + if (ctx-tx_curr_ndp16-dwSignature == sign) + ndp16 = ctx-tx_curr_ndp16; ndpoffset = le16_to_cpu(ndp16-wNextNdpIndex); } + + return ndp16; +} +static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp16_push(struct cdc_ncm_ctx *ctx, struct sk_buff *skb, __le32 sign, size_t reserve) +{ + struct usb_cdc_ncm_ndp16 *ndp16 = ctx-tx_curr_ndp16; + struct usb_cdc_ncm_nth16 *nth16 = (void *)skb-data; + /* align new NDP */ cdc_ncm_align_tail(skb, ctx-tx_ndp_modulus, 0, ctx-tx_max); @@ -1070,11 +1078,15 @@ cdc_ncm_fill_tx_frame(struct usbnet *dev, struct sk_buff *skb, __le32 sign) break; } - /* get the appropriate NDP for this skb */ - ndp16 = cdc_ncm_ndp(ctx, skb_out, sign, skb-len + ctx-tx_modulus + ctx-tx_remainder); - - /* align beginning of next frame */ - cdc_ncm_align_tail(skb_out, ctx-tx_modulus, ctx-tx_remainder, ctx-tx_max); + /* search for the appropriate NDP for this skb */ + ndp16 = cdc_ncm_ndp16_find(ctx, skb_out, sign); + + if (ndp16 == NULL) + { + ndp16 = cdc_ncm_ndp16_push(ctx, skb_out, sign, skb-len + ctx-tx_modulus + ctx-tx_remainder); + } +else + cdc_ncm_align_tail(skb_out, ctx-tx_modulus, ctx-tx_remainder, ctx-tx_max); /* check if we had enough room left for both NDP and frame */ if (!ndp16 || skb_out-len + skb-len ctx-tx_max) { -- 2.4.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ipv4: inet_bind: check the addr_len first
Perform the address length check first, before calling the the proto specific bind() function Signed-off-by: Denis Kirjanov k...@linux-powerpc.org --- net/ipv4/af_inet.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 6ad0f7a..333e2fa 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -426,14 +426,15 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) int chk_addr_ret; int err; + err = -EINVAL; + if (addr_len sizeof(struct sockaddr_in)) + goto out; + /* If the socket has its own bind function then use it. (RAW) */ if (sk-sk_prot-bind) { err = sk-sk_prot-bind(sk, uaddr, addr_len); goto out; } - err = -EINVAL; - if (addr_len sizeof(struct sockaddr_in)) - goto out; if (addr-sin_family != AF_INET) { /* Compatibility games : accept AF_UNSPEC (mapped to AF_INET) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] net: socket: Fix the wrong returns for recvmsg and sendmsg
On Tue, Jun 02, 2015 at 02:43:54PM +0800, Junling Zheng wrote: On 2015/6/2 14:27, Greg KH wrote: On Mon, Jun 01, 2015 at 10:23:57PM -0700, David Miller wrote: From: Junling Zheng zhengjunl...@huawei.com Date: Tue, 2 Jun 2015 12:05:32 +0800 So, the problem commit is 281c9c36 (net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour), which fixes db31c55a6fb2 and brings the get_compat_msghdr() in line with copy_msghdr_from_user(). Upstream this got fixed by: 08adb7dabd4874cc5666b4490653b26534702ce0 So the part that makes us not unconditionally return -EFAULT needs to be backported, and that's probably equivalent to the patch your proposed which therefore should be applied. Ok, thanks, now applied. Maybe other stable version also needs this fix:) Yes, from what I'm seeing, at least 3.2 and 2.6.32 need it as well. Thanks, Willy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] net/mlx4: need to call close fw if alloc icm is called twice
On Mon, Jun 1, 2015 at 5:41 PM, cls...@linux.vnet.ibm.com wrote: --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2837,6 +2837,7 @@ slave_start: existing_vfs, reset_flow); + mlx4_close_fw(dev); mlx4_cmd_cleanup(dev, MLX4_CMD_CLEANUP_ALL); dev-flags = dev_flags; if (!SRIOV_VALID_STATE(dev-flags)) { Acked-by: Or Gerlitz ogerl...@mellanox.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] net/mlx4: double free of dev_vfs
On Mon, Jun 1, 2015 at 5:41 PM, cls...@linux.vnet.ibm.com wrote: --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2685,6 +2685,7 @@ disable_sriov: free_mem: dev-persist-num_vfs = 0; kfree(dev-dev_vfs); + dev-dev_vfs = NULL; return dev_flags ~MLX4_FLAG_MASTER; } Acked-by: Or Gerlitz ogerl...@mellanox.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/10] net: thunderx: fix problems reported by static check tools
These are fixes for the problems that were reported by static check tools. Aleksey Makarov (9): net: thunderx: fix constants net: thunderx: introduce a function for mailbox access net: thunderx: rework mac address handling net: thunderx: delete unused variables net: thunderx: add static net: thunderx: fix nicvf_set_rxfh() net: thunderx: remove unneeded type conversions net: thunderx: check if memory allocation was successful net: thunderx: use GFP_KERNEL in thread context Robert Richter (1): net: thunderx: Cleanup duplicate NODE_ID macros, add nic_get_node_id() drivers/net/ethernet/cavium/thunder/nic.h | 16 +++-- drivers/net/ethernet/cavium/thunder/nic_main.c | 12 +--- .../net/ethernet/cavium/thunder/nicvf_ethtool.c| 3 +- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 73 +++--- drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 9 +-- drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 18 +++--- drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 7 +-- 7 files changed, 67 insertions(+), 71 deletions(-) -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
Robert Shearman rshea...@brocade.com writes: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. I am still digging into the details but adding a new network device to make this possible if very undesirable. It is a pain point. Those network devices get to be a major source of memory consumption when there are 4K network namespaces in existence. It is conceptually wrong. The network device will never be used as an ordinary network device. All the network device gives you is the ability to avoid creating an enumeration of different kinds of encapsulation. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 5/5] rocker: remove support for legacy VLAN ndo ops
On Tue, Jun 2, 2015 at 9:58 AM, roopa ro...@cumulusnetworks.com wrote: On 6/2/15, 7:30 AM, Scott Feldman wrote: On Tue, Jun 2, 2015 at 4:43 AM, Jamal Hadi Salim j...@mojatatu.com wrote: On 06/02/15 03:10, Scott Feldman wrote: Actually, we're now consistent with bridge man page which says master is the default. Want we want, I believe, is to adjust what the man page says (and the bridge vlan command itself), by making the default master and self. The kernel and driver are fine, it's the default in the bridge command that needs adjusting. Once we do this, we'll be back to transparent with software-only bridge. Question to ask when looking at something of this nature: Will it work with no suprises if you used today's unmodified app? The default behavior shouldnt change and unfortunately it does here. The default behavior does change, yes, but there shouldn't be any surprises even if using today's unmodified app. The reason why is no in-kernel driver is using ndo_bridge_setlink for VLAN setup. The three drivers that have ndo_bridge_setlink use if to set hwmode to VEBA|VEB. For VLAN setup, they use the (default master) bridge's ndo_bridge_setlink-ndo_vlan_rx_add_vid. If the default changes from master to master|self, the bridge's ndo_bridge_setlink-ndo_vlan_rx_add_vid is still called for those driver's using ndo_vlan_rx_add_vid, and if they implement ndo_bridge_setlink, they'll get called a second time but will noop because there will be no IFLA_BRIDGE_MODE (hwmode) attr to process. So it comes down to two choices: 1) break ABI, which is inconsequential for in-kernel drivers and preserve (iproute2) command transparency, or 2) embrace existing behavior which is consistent with man pages but breaks command transparency for any driver implementing ndo_bridge_setlink for VLAN setup, which currently is just rocker. I can see the DSA going down this path also based on another concurrent thread. We're at option 2) right now. It is not just iproute2 - since this is breaking ABI expectations. Looking at some app i wrote a while back based on analyzing kernel expectations at the time, I see the following logic: user can set master or self on command line. ... if (user DID NOT set master_on || user set self on) then set self to on iow, current behavior: 01: master is only set if user explicitly asked. 11: master|self when user explicitly sets both 10: self is on by default when the user doesnt specify anything 00: and the last option is to have none set which is not possible since we have defaults. cheers, jamal So this is very similar to iproute2 - if nothing is set it defaults to self. Ha, you're giving the behavior for bridge fdb command, where self is the default. Oh...i did not realize this was the case either. Thats unfortunate. For bridge link and bridge vlan, the default is master. The user must explicitly specify self to act on the device side of the port. It's unfortunate the iproute2 defaults aren't consistent between commands. Maybe someone knows the history here and can explain. scott, this brings back the discussion you and i had over the revert of my patches.. (commit id's at the end of this email)... which used to seamlessly offload to switchdev from bridge driver if the port was a switch port (similar to stp state offload). Your patch tried to do the same thing that the bridge's ndo_bridge_setlink/dellink is doing which is using the handler for MASTER to also set SELF stuff, when SELF was not specified. I don't feel we should be overriding the application defaults in the kernel; instead, we should change the application if we want different behavior. The kernel should treat the two sides of the port independent (that's the basic algo in rtnetlink.c handlers for MASTER/SELF things). When you start doing kernel SELF things in the MASTER path, the application has lost the ability to address each side of the port independently. 'self' used to exist before switchdev infra came in. My suggestion was to use it where required...but not build the switchdev api on the presence of 'self'. switchdev layer should be consistent across...all fib/fdb/neigh layers. I don't understand why you're bringing up fib/neigh because there is no master|self form for those. The master|self objects are bridge fdb, settings, and vlans. To be clear, they are PF_BRIDGE handlers for: PF_BRIDGE:RTM_NEWNEIGH: add fdb entry PF_BRIDGE:RTM_DELNEIGH: del fdb entry PF_BRIDGE:RTM_SETLINK: set bridge setting or add VLAN PF_BRIDGE:RTM_DELLINK: del VLAN The net/core/rtnetlink.c code for these _is_ consistent right now. They all perform this same basic algorithm: handler() if (!flags || flags MASTER) if (master master-op-foo) master-op-foo(); if (flags SELF) if (port-op-foo) port-op-foo(); This lets the application set MASTER and/or SELF
Re: [PATCH v4 00/25] Convert the posix_clock_operations and k_clock structure to ready for 2038
On Mon, 1 Jun 2015, Baolin Wang wrote: You failed to thread the patch series again -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] net: thunderx: rework mac address handling
This fixes sparse message: drivers/net/ethernet/cavium/thunder/nicvf_main.c:385:40: sparse: cast to restricted __le64 Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nic.h | 4 ++-- drivers/net/ethernet/cavium/thunder/nic_main.c| 8 +--- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 8 ++-- drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++-- drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 4 ++-- 5 files changed, 9 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h index 4f426db..6479ce2 100644 --- a/drivers/net/ethernet/cavium/thunder/nic.h +++ b/drivers/net/ethernet/cavium/thunder/nic.h @@ -301,7 +301,7 @@ struct nic_cfg_msg { u8vf_id; u8tns_mode; u8node_id; - u64 mac_addr; + u8mac_addr[ETH_ALEN]; }; /* Qset configuration */ @@ -331,7 +331,7 @@ struct sq_cfg_msg { struct set_mac_msg { u8msg; u8vf_id; - u64 addr; + u8mac_addr[ETH_ALEN]; }; /* Set Maximum frame size */ diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c index 3ca7ad8..6e0c031 100644 --- a/drivers/net/ethernet/cavium/thunder/nic_main.c +++ b/drivers/net/ethernet/cavium/thunder/nic_main.c @@ -492,7 +492,6 @@ static void nic_handle_mbx_intr(struct nicpf *nic, int vf) u64 *mbx_data; u64 mbx_addr; u64 reg_addr; - u64 mac_addr; int bgx, lmac; int i; int ret = 0; @@ -555,12 +554,7 @@ static void nic_handle_mbx_intr(struct nicpf *nic, int vf) lmac = mbx.mac.vf_id; bgx = NIC_GET_BGX_FROM_VF_LMAC_MAP(nic-vf_lmac_map[lmac]); lmac = NIC_GET_LMAC_FROM_VF_LMAC_MAP(nic-vf_lmac_map[lmac]); -#ifdef __BIG_ENDIAN - mac_addr = cpu_to_be64(mbx.nic_cfg.mac_addr) 16; -#else - mac_addr = cpu_to_be64(mbx.nic_cfg.mac_addr) 16; -#endif - bgx_set_lmac_mac(nic-node, bgx, lmac, (u8 *)mac_addr); + bgx_set_lmac_mac(nic-node, bgx, lmac, mbx.mac.mac_addr); break; case NIC_MBOX_MSG_SET_MAX_FRS: ret = nic_update_hw_frs(nic, mbx.frs.max_frs, diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 989f005..54bba86 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -197,8 +197,7 @@ static void nicvf_handle_mbx_intr(struct nicvf *nic) nic-vf_id = mbx.nic_cfg.vf_id 0x7F; nic-tns_mode = mbx.nic_cfg.tns_mode 0x7F; nic-node = mbx.nic_cfg.node_id; - ether_addr_copy(nic-netdev-dev_addr, - (u8 *)mbx.nic_cfg.mac_addr); + ether_addr_copy(nic-netdev-dev_addr, mbx.nic_cfg.mac_addr); nic-link_up = false; nic-duplex = 0; nic-speed = 0; @@ -248,13 +247,10 @@ static void nicvf_handle_mbx_intr(struct nicvf *nic) static int nicvf_hw_set_mac_addr(struct nicvf *nic, struct net_device *netdev) { union nic_mbx mbx = {}; - int i; mbx.mac.msg = NIC_MBOX_MSG_SET_MAC; mbx.mac.vf_id = nic-vf_id; - for (i = 0; i ETH_ALEN; i++) - mbx.mac.addr = (mbx.mac.addr 8) | -netdev-dev_addr[i]; + ether_addr_copy(mbx.mac.mac_addr, netdev-dev_addr); return nicvf_send_msg_to_pf(nic, mbx); } diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c index cde604a..a58924c 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c @@ -163,7 +163,7 @@ void bgx_get_lmac_link_state(int node, int bgx_idx, int lmacid, void *status) } EXPORT_SYMBOL(bgx_get_lmac_link_state); -const char *bgx_get_lmac_mac(int node, int bgx_idx, int lmacid) +const u8 *bgx_get_lmac_mac(int node, int bgx_idx, int lmacid) { struct bgx *bgx = bgx_vnic[(node * MAX_BGX_PER_CN88XX) + bgx_idx]; @@ -174,7 +174,7 @@ const char *bgx_get_lmac_mac(int node, int bgx_idx, int lmacid) } EXPORT_SYMBOL(bgx_get_lmac_mac); -void bgx_set_lmac_mac(int node, int bgx_idx, int lmacid, const char *mac) +void bgx_set_lmac_mac(int node, int bgx_idx, int lmacid, const u8 *mac) { struct bgx *bgx = bgx_vnic[(node * MAX_BGX_PER_CN88XX) + bgx_idx]; diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h index f9e2170..ba4f53b 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h @@ -183,8 +183,8 @@ enum MCAST_MODE { void
[PATCH 07/10] net: thunderx: fix nicvf_set_rxfh()
This fixes a copypaste bug that was discovered by a static analysis tool: The patch 4863dea3fab0: net: Adding support for Cavium ThunderX network controller from May 26, 2015, leads to the following static checker warning: drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c:517 nicvf_set_rxfh() warn: we tested 'hkey' before and it was 'false' drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c 506 /* We do not allow change in unsupported parameters */ 507 if (hkey || We return here. 508 (hfunc != ETH_RSS_HASH_NO_CHANGE hfunc != ETH_RSS_HASH_TOP)) 509 return -EOPNOTSUPP; 510 511 rss-enable = true; 512 if (indir) { 513 for (idx = 0; idx rss-rss_size; idx++) 514 rss-ind_tbl[idx] = indir[idx]; 515 } 516 517 if (hkey) { So this is dead code. 518 memcpy(rss-key, hkey, RSS_HASH_KEY_SIZE * sizeof(u64)); 519 nicvf_set_rss_key(nic); 520 } 521 522 nicvf_config_rss(nic); 523 return 0; 524 } regards, dan carpenter Reported-by: Dan Carpenter dan.carpen...@oracle.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c index 0fc4a53..16bd2d7 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c @@ -504,8 +504,7 @@ static int nicvf_set_rxfh(struct net_device *dev, const u32 *indir, } /* We do not allow change in unsupported parameters */ - if (hkey || - (hfunc != ETH_RSS_HASH_NO_CHANGE hfunc != ETH_RSS_HASH_TOP)) + if (hfunc != ETH_RSS_HASH_NO_CHANGE hfunc != ETH_RSS_HASH_TOP) return -EOPNOTSUPP; rss-enable = true; -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] net: thunderx: introduce a function for mailbox access
This fixes sparse message: drivers/net/ethernet/cavium/thunder/nicvf_main.c:153:25: sparse: cast to restricted __le64 Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 27 +++- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index f81182c..989f005 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -110,17 +110,23 @@ u64 nicvf_queue_reg_read(struct nicvf *nic, u64 offset, u64 qidx) /* VF - PF mailbox communication */ +static void nicvf_write_to_mbx(struct nicvf *nic, union nic_mbx *mbx) +{ + u64 *msg = (u64 *)mbx; + + nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 0, msg[0]); + nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 8, msg[1]); +} + int nicvf_send_msg_to_pf(struct nicvf *nic, union nic_mbx *mbx) { int timeout = NIC_MBOX_MSG_TIMEOUT; int sleep = 10; - u64 *msg = (u64 *)mbx; nic-pf_acked = false; nic-pf_nacked = false; - nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 0, msg[0]); - nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 8, msg[1]); + nicvf_write_to_mbx(nic, mbx); /* Wait for previous message to be acked, timeout 2sec */ while (!nic-pf_acked) { @@ -146,12 +152,13 @@ int nicvf_send_msg_to_pf(struct nicvf *nic, union nic_mbx *mbx) static int nicvf_check_pf_ready(struct nicvf *nic) { int timeout = 5000, sleep = 20; + union nic_mbx mbx = {}; + + mbx.msg.msg = NIC_MBOX_MSG_READY; nic-pf_ready_to_rcv_msg = false; - nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 0, - le64_to_cpu(NIC_MBOX_MSG_READY)); - nicvf_reg_write(nic, NIC_VF_PF_MAILBOX_0_1 + 8, 1ULL); + nicvf_write_to_mbx(nic, mbx); while (!nic-pf_ready_to_rcv_msg) { msleep(sleep); @@ -368,7 +375,9 @@ int nicvf_set_real_num_queues(struct net_device *netdev, static int nicvf_init_resources(struct nicvf *nic) { int err; - u64 mbx_addr = NIC_VF_PF_MAILBOX_0_1; + union nic_mbx mbx = {}; + + mbx.msg.msg = NIC_MBOX_MSG_CFG_DONE; /* Enable Qset */ nicvf_qset_config(nic, true); @@ -382,9 +391,7 @@ static int nicvf_init_resources(struct nicvf *nic) } /* Send VF config done msg to PF */ - nicvf_reg_write(nic, mbx_addr, le64_to_cpu(NIC_MBOX_MSG_CFG_DONE)); - mbx_addr += (NIC_PF_VF_MAILBOX_SIZE - 1) * 8; - nicvf_reg_write(nic, mbx_addr, 1ULL); + nicvf_write_to_mbx(nic, mbx); return 0; } -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Intel-wired-lan] [PATCH 1/2] pci: Add dev_flags bit to access VPD through function 0
On Jun 2, 2015, at 10:48 AM, Alexander Duyck alexander.h.du...@redhat.com wrote: I'm pretty sure these could cause some serious errors if you direct assign the device into a VM since you then end up with multiple devices sharing a bus. Also it would likely have side-effects on a LOM (Lan On Motherboard) as it also shares the bus with multiple non-Ethernet devices. I believe you still need to add something like a check for !pci_is_root_bus(dev-bus) before you attempt to grab function 0. It probably also wouldn't hurt to check the dev-multifunction bit before running this code since it wouldn't make sense to go chasing down the VPD on another function if the device doesn't have one. You could probably do that either as a part of this code, or perhaps put it in the quirk. I'll look into those. I think you are right about more checks being needed. Thanks for the comments. -- Mark Rustad, Networking Division, Intel Corporation signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
roopa ro...@cumulusnetworks.com writes: On 6/1/15, 9:46 AM, Robert Shearman wrote: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. RFC because: - IPv6 side not implemented - struct rtable shouldn't be bloated by pointer+uint - Hasn't been thoroughly tested yet Robert Shearman (3): net: infra for per-nexthop encap data ipv4: storing and retrieval of per-nexthop encap mpls: new ipmpls device for encapsulating IP packets as mpls Glad to see these patches!. I have a similar series i have been working on...but no netdevice. A set of ops similar to iptun_encaps and I store encap data in fib_nh and in ip_route_output_slow i point the dst.output to the output func provided by one of the encap ops. I see the advantages of using a netdevice...and i see this align with patches from thomas. roopa I think I would prefer your patches. I thinking using a netdevice the way Robert is proposing is quite possibly a mess, from a scalability stand point. Do you mean ip_route_input_slow? There is no ip_route_output_slow. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] net: thunderx: delete unused variables
They were left from development stage Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c index a58924c..83476f0 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c @@ -38,7 +38,7 @@ struct lmac { boolis_sgmii; struct delayed_work dwork; struct workqueue_struct *check_link; -} lmac; +}; struct bgx { u8 bgx_id; @@ -50,7 +50,7 @@ struct bgx { int use_training; void __iomem*reg_base; struct pci_dev *pdev; -} bgx; +}; struct bgx *bgx_vnic[MAX_BGX_THUNDER]; static int lmac_count; /* Total no of LMACs in system */ -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] net: thunderx: check if memory allocation was successful
This fixes a coccinelle warning: coccinelle warnings: (new ones prefixed by ) drivers/net/ethernet/cavium/thunder/nicvf_queues.c:360:1-11: alloc with no test, possible model on line 367 vim +360 drivers/net/ethernet/cavium/thunder/nicvf_queues.c 354 err = nicvf_alloc_q_desc_mem(nic, sq-dmem, q_len, SND_QUEUE_DESC_SIZE, 355 NICVF_SQ_BASE_ALIGN_BYTES); 356 if (err) 357 return err; 358 359 sq-desc = sq-dmem.base; 360 sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_ATOMIC); 361 sq-head = 0; 362 sq-tail = 0; 363 atomic_set(sq-free_cnt, q_len - 1); 364 sq-thresh = SND_QUEUE_THRESH; 365 366 /* Preallocate memory for TSO segment's header */ 367 sq-tso_hdrs = dma_alloc_coherent(nic-pdev-dev, 368q_len * TSO_HEADER_SIZE, 369sq-tso_hdrs_phys, GFP_KERNEL); 370 if (!sq-tso_hdrs) Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c index 8929029..2ed7d1b 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c @@ -357,6 +357,8 @@ static int nicvf_init_snd_queue(struct nicvf *nic, sq-desc = sq-dmem.base; sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_ATOMIC); + if (!sq-skbuff) + return -ENOMEM; sq-head = 0; sq-tail = 0; atomic_set(sq-free_cnt, q_len - 1); -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] net: thunderx: use GFP_KERNEL in thread context
GFP_KERNEL should be used in the thread context Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c index 2ed7d1b..d69d228 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c @@ -356,7 +356,7 @@ static int nicvf_init_snd_queue(struct nicvf *nic, return err; sq-desc = sq-dmem.base; - sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_ATOMIC); + sq-skbuff = kcalloc(q_len, sizeof(u64), GFP_KERNEL); if (!sq-skbuff) return -ENOMEM; sq-head = 0; -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] net: thunderx: remove unneeded type conversions
No need to cast void* to u8*: pointer arithmetics works same way for both. Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c index 7f0e108..8929029 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c @@ -62,8 +62,7 @@ static int nicvf_alloc_q_desc_mem(struct nicvf *nic, struct q_desc_mem *dmem, /* Align memory address for 'align_bytes' */ dmem-phys_base = NICVF_ALIGNED_ADDR((u64)dmem-dma, align_bytes); - dmem-base = (void *)((u8 *)dmem-unalign_base + - (dmem-phys_base - dmem-dma)); + dmem-base = dmem-unalign_base + (dmem-phys_base - dmem-dma); return 0; } -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] net: thunderx: add static
This fixes sparse messages like this: drivers/net/ethernet/cavium/thunder/nicvf_main.c:1141:26: sparse: symbol 'nicvf_get_stats64' was not declared. Should it be static? Also remove unused declarations Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nic.h | 2 -- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 ++ drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 2 +- drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 6 ++--- 4 files changed, 16 insertions(+), 22 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h index 6479ce2..a3b43e5 100644 --- a/drivers/net/ethernet/cavium/thunder/nic.h +++ b/drivers/net/ethernet/cavium/thunder/nic.h @@ -413,10 +413,8 @@ int nicvf_set_real_num_queues(struct net_device *netdev, int nicvf_open(struct net_device *netdev); int nicvf_stop(struct net_device *netdev); int nicvf_send_msg_to_pf(struct nicvf *vf, union nic_mbx *mbx); -void nicvf_config_cpi(struct nicvf *nic); void nicvf_config_rss(struct nicvf *nic); void nicvf_set_rss_key(struct nicvf *nic); -void nicvf_free_skb(struct nicvf *nic, struct sk_buff *skb); void nicvf_set_ethtool_ops(struct net_device *netdev); void nicvf_update_stats(struct nicvf *nic); void nicvf_update_lmac_stats(struct nicvf *nic); diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 54bba86..02da802 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -50,10 +50,6 @@ module_param(cpi_alg, int, S_IRUGO); MODULE_PARM_DESC(cpi_alg, PFC algorithm (0=none, 1=VLAN, 2=VLAN16, 3=IP Diffserv)); -static int nicvf_enable_msix(struct nicvf *nic); -static netdev_tx_t nicvf_xmit(struct sk_buff *skb, struct net_device *netdev); -static void nicvf_read_bgx_stats(struct nicvf *nic, struct bgx_stats_msg *bgx); - static inline void nicvf_set_rx_frame_cnt(struct nicvf *nic, struct sk_buff *skb) { @@ -174,6 +170,14 @@ static int nicvf_check_pf_ready(struct nicvf *nic) return 1; } +static void nicvf_read_bgx_stats(struct nicvf *nic, struct bgx_stats_msg *bgx) +{ + if (bgx-rx) + nic-bgx_stats.rx_stats[bgx-idx] = bgx-stats; + else + nic-bgx_stats.tx_stats[bgx-idx] = bgx-stats; +} + static void nicvf_handle_mbx_intr(struct nicvf *nic) { union nic_mbx mbx = {}; @@ -255,7 +259,7 @@ static int nicvf_hw_set_mac_addr(struct nicvf *nic, struct net_device *netdev) return nicvf_send_msg_to_pf(nic, mbx); } -void nicvf_config_cpi(struct nicvf *nic) +static void nicvf_config_cpi(struct nicvf *nic) { union nic_mbx mbx = {}; @@ -267,7 +271,7 @@ void nicvf_config_cpi(struct nicvf *nic) nicvf_send_msg_to_pf(nic, mbx); } -void nicvf_get_rss_size(struct nicvf *nic) +static void nicvf_get_rss_size(struct nicvf *nic) { union nic_mbx mbx = {}; @@ -575,7 +579,7 @@ static int nicvf_poll(struct napi_struct *napi, int budget) * * As of now only CQ errors are handled */ -void nicvf_handle_qs_err(unsigned long data) +static void nicvf_handle_qs_err(unsigned long data) { struct nicvf *nic = (struct nicvf *)data; struct queue_set *qs = nic-qs; @@ -1043,14 +1047,6 @@ static int nicvf_set_mac_address(struct net_device *netdev, void *p) return 0; } -static void nicvf_read_bgx_stats(struct nicvf *nic, struct bgx_stats_msg *bgx) -{ - if (bgx-rx) - nic-bgx_stats.rx_stats[bgx-idx] = bgx-stats; - else - nic-bgx_stats.tx_stats[bgx-idx] = bgx-stats; -} - void nicvf_update_lmac_stats(struct nicvf *nic) { int stat = 0; @@ -1141,7 +1137,7 @@ void nicvf_update_stats(struct nicvf *nic) nicvf_update_sq_stats(nic, qidx); } -struct rtnl_link_stats64 *nicvf_get_stats64(struct net_device *netdev, +static struct rtnl_link_stats64 *nicvf_get_stats64(struct net_device *netdev, struct rtnl_link_stats64 *stats) { struct nicvf *nic = netdev_priv(netdev); diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c index 1962466..7f0e108 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c @@ -228,7 +228,7 @@ static void nicvf_free_rbdr(struct nicvf *nic, struct rbdr *rbdr) /* Refill receive buffer descriptors with new buffers. */ -void nicvf_refill_rbdr(struct nicvf *nic, gfp_t gfp) +static void nicvf_refill_rbdr(struct nicvf *nic, gfp_t gfp) { struct queue_set *qs = nic-qs; int rbdr_idx = qs-rbdr_cnt; diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
Re: [ovs-dev] [net-next RFC 00/14] Convert OVS tunnel vports to use regular net_devices
It seems patch 01 didn't make it to ovs dev mailing list, but it is available on netdev mailing list. fbl -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/10] net: thunderx: fix constants
This fixes sparse messages like this: drivers/net/ethernet/cavium/thunder/thunder_bgx.c:897:24: sparse: constant 0x3000 is so big it is long Reported-by: kbuild test robot fengguang...@intel.com Signed-off-by: Aleksey Makarov aleksey.maka...@caviumnetworks.com --- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index abd446e6..f81182c 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -326,11 +326,11 @@ static int nicvf_rss_init(struct nicvf *nic) rss-enable = true; /* Using the HW reset value for now */ - rss-key[0] = 0xFEED0BADFEED0BAD; - rss-key[1] = 0xFEED0BADFEED0BAD; - rss-key[2] = 0xFEED0BADFEED0BAD; - rss-key[3] = 0xFEED0BADFEED0BAD; - rss-key[4] = 0xFEED0BADFEED0BAD; + rss-key[0] = 0xFEED0BADFEED0BADULL; + rss-key[1] = 0xFEED0BADFEED0BADULL; + rss-key[2] = 0xFEED0BADFEED0BADULL; + rss-key[3] = 0xFEED0BADFEED0BADULL; + rss-key[4] = 0xFEED0BADFEED0BADULL; nicvf_set_rss_key(nic); -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
Robert Shearman rshea...@brocade.com writes: Allow creating an mpls device for the purposes of encapsulating IP packets with: ip link add type ipmpls This device defines its per-nexthop encapsulation data as a stack of labels, in the same format as for RTA_NEWST. It uses the encap data which will have been stored in the IP route to encapsulate the packet with that stack of labels, with the last label corresponding to a local label that defines how the packet will be sent out. The device sends packets over loopback to the local MPLS forwarding logic which performs all of the work. Stats are implemented, although any error in the sending via the real interface will be handled by the main mpls forwarding code and so not accounted by the interface. Eeek stats! Lots of unnecessary overhead. If stats were ok we could have simply reduced the cost of struct net_device to the point where it would not matter. This is really a bad hack for not getting in and being able to set dst_output the way the xfrm infrastructure does. What we really want here is xfrm-lite. By lite I mean the tunnel selection criteria is simple enough that it fits into the normal routing table instead of having to do weird flow based magic that is rarely needed. I believe what we want are the xfrm stacking of dst entries. Eric This implementation is based on an alternative earlier implementation by Eric W. Biederman. Signed-off-by: Robert Shearman rshea...@brocade.com --- include/uapi/linux/if_arp.h | 1 + net/mpls/Kconfig| 5 + net/mpls/Makefile | 1 + net/mpls/af_mpls.c | 2 + net/mpls/ipmpls.c | 284 5 files changed, 293 insertions(+) create mode 100644 net/mpls/ipmpls.c diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h index 4d024d75d64b..17d669fd1781 100644 --- a/include/uapi/linux/if_arp.h +++ b/include/uapi/linux/if_arp.h @@ -88,6 +88,7 @@ #define ARPHRD_IEEE80211_RADIOTAP 803/* IEEE 802.11 + radiotap header */ #define ARPHRD_IEEE802154 804 #define ARPHRD_IEEE802154_MONITOR 805/* IEEE 802.15.4 network monitor */ +#define ARPHRD_MPLS 806 /* IP and IPv6 over MPLS tunnels */ #define ARPHRD_PHONET820 /* PhoNet media type */ #define ARPHRD_PHONET_PIPE 821 /* PhoNet pipe header */ diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig index 17bde799c854..5264da94733a 100644 --- a/net/mpls/Kconfig +++ b/net/mpls/Kconfig @@ -27,4 +27,9 @@ config MPLS_ROUTING help Add support for forwarding of mpls packets. +config MPLS_IPTUNNEL + tristate MPLS: IP over MPLS tunnel support + help + A network device that encapsulates ip packets as mpls + endif # MPLS diff --git a/net/mpls/Makefile b/net/mpls/Makefile index 65bbe68c72e6..3a93c14b23c5 100644 --- a/net/mpls/Makefile +++ b/net/mpls/Makefile @@ -3,5 +3,6 @@ # obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o +obj-$(CONFIG_MPLS_IPTUNNEL) += ipmpls.o mpls_router-y := af_mpls.o diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 7b3f732269e4..68bdfbdddfaf 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -615,6 +615,7 @@ int nla_put_labels(struct sk_buff *skb, int attrtype, return 0; } +EXPORT_SYMBOL(nla_put_labels); int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]) @@ -660,6 +661,7 @@ int nla_get_labels(const struct nlattr *nla, *labels = nla_labels; return 0; } +EXPORT_SYMBOL(nla_get_labels); static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, struct mpls_route_config *cfg) diff --git a/net/mpls/ipmpls.c b/net/mpls/ipmpls.c new file mode 100644 index ..cf6894ae0c61 --- /dev/null +++ b/net/mpls/ipmpls.c @@ -0,0 +1,284 @@ +#include linux/types.h +#include linux/netdevice.h +#include linux/if_vlan.h +#include linux/if_arp.h +#include linux/ip.h +#include linux/ipv6.h +#include linux/module.h +#include linux/mpls.h +#include internal.h + +static LIST_HEAD(ipmpls_dev_list); + +#define MAX_NEW_LABELS 2 + +struct ipmpls_dev_priv { + struct net_device *out_dev; + struct list_head list; + struct net_device *dev; +}; + +static netdev_tx_t ipmpls_dev_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipmpls_dev_priv *priv = netdev_priv(dev); + struct net_device *out_dev = priv-out_dev; + struct mpls_shim_hdr *hdr; + bool bottom_of_stack = true; + int len = skb-len; + const void *encap; + int num_labels; + unsigned ttl; + const u32 *labels; + int ret; + int i; + + num_labels = dst_get_encap(skb, encap) / 4; + if (!num_labels) + goto drop; + +
[PATCH] drivers/net/ethernet/dec/tulip/uli526x.c: fix misleading indentation in uli526x_timer
This code in drivers/net/ethernet/dec/tulip/uli526x.c function uli526x_timer: 1086 } else 1087 if ((tmp_cr12 0x3) db-link_failed) { [...snip...] 1109 } 1110 else if(!(tmp_cr12 0x3) db-link_failed) { [...snip...] 1117 } 1118 db-init=0; is misleadingly indented: the db-init=0 is indented as if part of the else clause at line 1086, but it is independent of it (no braces before the if at line 1087). This patch fixes the indentation to reflect the actual meaning of the code, though is it actually meant to be part of the else clause? (I'm a compiler developer, not a kernel person). It also adds spaces around the assignment, to placate checkpatch.pl. Seen via an experimental new gcc warning I'm working on for gcc 6, -Wmisleading-indentation, using gcc r223098 adding -Werror=misleading-indentation to KBUILD_CFLAGS in Makefile. The experimental GCC emits this warning (as an error), rightly IMHO: drivers/net/ethernet/dec/tulip/uli526x.c: In function ‘uli526x_timer’: drivers/net/ethernet/dec/tulip/uli526x.c:1118:3: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation] db-init=0; ^ drivers/net/ethernet/dec/tulip/uli526x.c:1086:4: note: ...this ‘else’ clause, but it is not } else ^ Hope this is helpful Dave Signed-off-by: David Malcolm dmalc...@redhat.com --- drivers/net/ethernet/dec/tulip/uli526x.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/dec/tulip/uli526x.c b/drivers/net/ethernet/dec/tulip/uli526x.c index 2c30c0c..447d092 100644 --- a/drivers/net/ethernet/dec/tulip/uli526x.c +++ b/drivers/net/ethernet/dec/tulip/uli526x.c @@ -1115,7 +1115,7 @@ static void uli526x_timer(unsigned long data) netif_carrier_off(dev); } } - db-init=0; + db-init = 0; /* Timer active again */ db-timer.expires = ULI526X_TIMER_WUT; -- 1.8.5.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V10 1/4] openvswitch: 802.1ad uapi changes.
openvswitch: Add support for 8021.AD Change the description of the VLAN tpid field. Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com --- include/uapi/linux/openvswitch.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index bbd49a0..f2ccdef 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -559,13 +559,13 @@ struct ovs_action_push_mpls { * @vlan_tci: Tag control identifier (TCI) to push. The CFI bit must be set * (but it will not be set in the 802.1Q header that is pushed). * - * The @vlan_tpid value is typically %ETH_P_8021Q. The only acceptable TPID - * values are those that the kernel module also parses as 802.1Q headers, to - * prevent %OVS_ACTION_ATTR_PUSH_VLAN followed by %OVS_ACTION_ATTR_POP_VLAN - * from having surprising results. + * The @vlan_tpid value is typically %ETH_P_8021Q or %ETH_P_8021AD. + * The only acceptable TPID values are those that the kernel module also parses + * as 802.1Q or 802.1AD headers, to prevent %OVS_ACTION_ATTR_PUSH_VLAN followed + * by %OVS_ACTION_ATTR_POP_VLAN from having surprising results. */ struct ovs_action_push_vlan { - __be16 vlan_tpid; /* 802.1Q TPID. */ + __be16 vlan_tpid; /* 802.1Q or 802.1ad TPID. */ __be16 vlan_tci;/* 802.1Q TCI (VLAN ID and priority). */ }; @@ -605,9 +605,10 @@ struct ovs_action_hash { * is copied from the value to the packet header field, rest of the bits are * left unchanged. The non-masked value bits must be passed in as zeroes. * Masking is not supported for the %OVS_KEY_ATTR_TUNNEL attribute. - * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q header onto the - * packet. - * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet. + * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q or 802.1ad header + * onto the packet. + * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q or 802.1ad header + * from the packet. * @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in * the nested %OVS_SAMPLE_ATTR_* attributes. * @OVS_ACTION_ATTR_PUSH_MPLS: Push a new MPLS label stack entry onto the -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] net: thunderx: Cleanup duplicate NODE_ID macros, add nic_get_node_id()
From: Robert Richter rrich...@cavium.com There are duplicate NODE_ID macro definitions. Move all of them to nic.h for usage in nic and bgx driver and introduce nic_get_node_id() helper function. This patch also fixes 64bit mask which should have been ULL by reworking the node calculation. Signed-off-by: Robert Richter rrich...@cavium.com --- drivers/net/ethernet/cavium/thunder/nic.h | 10 ++ drivers/net/ethernet/cavium/thunder/nic_main.c| 4 +--- drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++-- drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 3 --- 4 files changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h index 9b0be52..4f426db 100644 --- a/drivers/net/ethernet/cavium/thunder/nic.h +++ b/drivers/net/ethernet/cavium/thunder/nic.h @@ -11,6 +11,7 @@ #include linux/netdevice.h #include linux/interrupt.h +#include linux/pci.h #include thunder_bgx.h /* PCI device IDs */ @@ -398,6 +399,15 @@ union nic_mbx { struct bgx_link_status link_status; }; +#define NIC_NODE_ID_MASK 0x03 +#define NIC_NODE_ID_SHIFT 44 + +static inline int nic_get_node_id(struct pci_dev *pdev) +{ + u64 addr = pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM); + return ((addr NIC_NODE_ID_SHIFT) NIC_NODE_ID_MASK); +} + int nicvf_set_real_num_queues(struct net_device *netdev, int tx_queues, int rx_queues); int nicvf_open(struct net_device *netdev); diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c index 0f1f58b..3ca7ad8 100644 --- a/drivers/net/ethernet/cavium/thunder/nic_main.c +++ b/drivers/net/ethernet/cavium/thunder/nic_main.c @@ -23,8 +23,6 @@ struct nicpf { struct pci_dev *pdev; u8 rev_id; -#define NIC_NODE_ID_MASK 0x3000 -#define NIC_NODE_ID(x) ((x NODE_ID_MASK) 44) u8 node; unsigned intflags; u8 num_vf_en; /* No of VF enabled */ @@ -851,7 +849,7 @@ static int nic_probe(struct pci_dev *pdev, const struct pci_device_id *ent) pci_read_config_byte(pdev, PCI_REVISION_ID, nic-rev_id); - nic-node = NIC_NODE_ID(pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM)); + nic-node = nic_get_node_id(pdev); nic_set_lmac_vf_mapping(nic); diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c index 020e11c..cde604a 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c @@ -894,8 +894,8 @@ static int bgx_probe(struct pci_dev *pdev, const struct pci_device_id *ent) goto err_release_regions; } bgx-bgx_id = (pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM) 24) 1; - bgx-bgx_id += NODE_ID(pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM)) - * MAX_BGX_PER_CN88XX; + bgx-bgx_id += nic_get_node_id(pdev) * MAX_BGX_PER_CN88XX; + bgx_vnic[bgx-bgx_id] = bgx; bgx_get_qlm_mode(bgx); diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h index 9d91ce4..f9e2170 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h @@ -20,9 +20,6 @@ #defineMAX_LMAC(MAX_BGX_PER_CN88XX * MAX_LMAC_PER_BGX) -#defineNODE_ID_MASK0x3000 -#defineNODE_ID(x) ((x NODE_ID_MASK) 44) - /* Registers */ #define BGX_CMRX_CFG 0x00 #define CMR_PKT_TX_EN BIT_ULL(13) -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC 00/14] Convert OVS tunnel vports to use regular net_devices
Thomas Graf tg...@suug.ch writes: This is the first series in a greater effort to bring the scalability and programmability advantages of OVS to the rest of the network stack and to get rid of as much OVS specific code as possible. This first series focuses on getting rid of OVS tunnel vports and use regular tunnel net_devices instead. As part of this effort, the routing subsystem is extended with support for flow based tunneling. In this new tunneling mode, the route is able to match on tunnel information as well as set tunnel encapsulation parameters per route. This allows to perform L3 forwarding for a large number of tunnel endpoints and virtual networks using a single tunnel net_device. This is a different direction than I was imagining things evolving when I was looking at mpls. However there is a lot of overlap. I get the imperession there are two directions you are looking at: - Allowing more configurable keeps in route based lookup. - Reducing the costs of the tunnels. We already have a similar subsystem xfrm. If we are going to use more flexible keys when lookup up routes, if it is reasonably possible (while maintaining performance) I suggest we use the xfrm data structure or more likely rework xfrm on top of the new data structures. That way there is less code to maintain overall. Certainly any work that plays with tunnels a new way to do tunnels in the kernel needs to answer the question. Why not xfrm. As xfrm already exists to do exactly that job. I think a clumsy api and excess flexibility start to be an answer for mpls ingress. Just using the existing routing table can result in cleaner faster code with a better userspace API. But I still think the mpls case where we attach labels needs to answer that case. If you are using flow based flexibility from openvswitch I think why not use xfrm becomes a more challenge question to answer. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How do I avoid recvmsg races with IP_RECVERR?
As far as I can tell, enabling IP_RECVERR causes the presence of a queued error to cause recvmsg, etc to return an error (once). It's worse, though: a new error can be queued asynchronously at any time, this setting sk_err to a nonzero value. How do I sensibly distinguish recvmsg failures to to genuine errors receiving messages from recvmsg failures because there's a queued error? The only way I can see to get reliable error handling is to literally call recvmsg in a loop: while (true /* or while POLLIN is set */) { int ret = recvmsg(..., MSG_ERRQUEUE not set); if (ret 0 /* what goes here? */) { whoops! this might be a harmless asynchronous error! take no action! } /* if POLLERR (or maybe unconditionally), recvmsg(..., MSG_ERRQUEUE); } The problem is that, if I'm screwing something up (thus causing EINVAL or something similar), this will just spin forever. Am I missing something here? Would it make sense to add MSG_IGNORE_ERROR to suppress the sock_error check or IP_RECVERR=2 to stop setting sk_err? Thanks, Andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/10] net: thunderx: fix problems reported by static check tools
From: Aleksey Makarov aleksey.maka...@caviumnetworks.com Date: Tue, 2 Jun 2015 11:00:17 -0700 These are fixes for the problems that were reported by static check tools. Series applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V10 0/4] openvswitch: Add support for 802.1AD
Add support for 802.1AD to the openvswitch kernel module. V10: Implement reviewer comments: Consolidate vlan parsing functions. Splits netlink parsing and flow conversion into a separate patch. Uses double encap attribute encapsulation for 802.1ad. Netlink attributes now look like this: eth_type(0x88a8),vlan(vid=100),encap(eth_type(0x8100), vlan(vid=200), encap(eth_type(0x0800), ...)) The double encap atributes in this version of the patch is incompatible with old versions of the user level 802.1ad patch. A new user level patch which is also being submitted simultaneously to openvswitch dev mailing list. V9: Includes changes suggested by reviewers V8: Includes changes suggested by reviewers V7: Includes changes suggested by reviewers V6: Rebased to net-next V5: Use encapsulated attributes Although the Open Flow specification specified support for 802.1AD (qinq) as well as push and pop vlan headers, So far Open vSwitch has only supported a single tag header. This patch accompanies version 10 of the user level openvswitch patch submitted to openvswitch dev list. For discussion, history and previous versions of the kernel module patch and the user code patch see the OVS dev mailing list, openvswitch.org/pipermail/dev/.. Thomas F Herbert (4): openvswitch: 802.1ad uapi changes. Check for vlan ethernet types for 8021.q or 802.1ad 8021AD: Flow handling actions and parsing 8021AD: Flow key parsing and netlink attributes. include/linux/if_vlan.h | 9 ++ include/uapi/linux/openvswitch.h | 17 ++-- net/openvswitch/flow.c | 82 ++--- net/openvswitch/flow.h | 3 + net/openvswitch/flow_netlink.c | 186 +-- 5 files changed, 248 insertions(+), 49 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V10 2/4] General check for vlan ethernet types
This patch adds a function to check for vlan ethernet types. There is a use case in openvswitch and it should be useful elsewhere. Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com --- include/linux/if_vlan.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index 920e445..3713454 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -627,5 +627,14 @@ static inline netdev_features_t vlan_features_check(const struct sk_buff *skb, return features; } +/** + * Check for legal valid vlan ether type. + */ +static inline bool eth_type_vlan(__be16 ethertype) +{ + if (ethertype == htons(ETH_P_8021Q) || ethertype == htons(ETH_P_8021AD)) + return true; + return false; +} #endif /* !(_LINUX_IF_VLAN_H_) */ -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V10 4/4] 8021AD: Flow key parsing and netlink attributes.
Add support for 802.1ad to netlink parsing and flow conversation. Uses double nested encap attributes to represent double tagged vlan. Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com --- net/openvswitch/flow_netlink.c | 186 ++--- 1 file changed, 157 insertions(+), 29 deletions(-) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index c691b1a..8fd4f63 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -771,6 +771,28 @@ static int metadata_from_nlattrs(struct sw_flow_match *match, u64 *attrs, return 0; } +static int cust_vlan_from_nlattrs(struct sw_flow_match *match, u64 attrs, + const struct nlattr **a, bool is_mask, + bool log) +{ + /* This should be nested inner or customer tci */ + if (attrs (1 OVS_KEY_ATTR_VLAN)) { + __be16 ctci; + + ctci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]); + if (!(ctci htons(VLAN_TAG_PRESENT))) { + if (is_mask) + OVS_NLERR(log, VLAN CTCI mask does not have exact match for VLAN_TAG_PRESENT bit.); + else + OVS_NLERR(log, VLAN CTCI does not have VLAN_TAG_PRESENT bit set.); + + return -EINVAL; + } + SW_FLOW_KEY_PUT(match, eth.ctci, ctci, is_mask); + } + return 0; +} + static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs, const struct nlattr **a, bool is_mask, bool log) @@ -1024,6 +1046,105 @@ static void mask_set_nlattr(struct nlattr *attr, u8 val) nlattr_set(attr, val, ovs_key_lens); } +static int parse_vlan_from_nlattrs(const struct nlattr *nla, + struct sw_flow_match *match, + u64 *key_attrs, bool *ie_valid, + const struct nlattr **a, bool is_mask, + bool log) +{ + int err; + __be16 tci; + const struct nlattr *encap; + + if (!is_mask) { + u64 v_attrs = 0; + + tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]); + + if (tci htons(VLAN_TAG_PRESENT)) { + if (unlikely((nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]) == + htons(ETH_P_8021AD { + err = parse_flow_nlattrs(nla, a, v_attrs, log); + if (err) + return err; + if (!v_attrs) + return -EINVAL; + + if (!((v_attrs + (1ULL OVS_KEY_ATTR_VLAN)) + (v_attrs + (1ULL OVS_KEY_ATTR_ENCAP { + OVS_NLERR(log, Invalid Vlan frame.); + return -EINVAL; + } + v_attrs = ~(1 OVS_KEY_ATTR_ETHERTYPE); + encap = a[OVS_KEY_ATTR_ENCAP]; + v_attrs = ~(1 OVS_KEY_ATTR_ENCAP); + *ie_valid = true; + + err = cust_vlan_from_nlattrs(match, v_attrs, +encap, is_mask, +log); + if (err) + return err; + /* Insure that tci key attribute isn't +* overwritten by encapsulated customer tci. +*/ + v_attrs = ~(1 OVS_KEY_ATTR_VLAN); + *key_attrs |= v_attrs; + } else { + *key_attrs = ~(1 OVS_KEY_ATTR_VLAN); + err = parse_flow_nlattrs(nla, a, key_attrs, +log); + if (err) + return err; + } + } else if (!tci) { + /* Corner case for truncated 802.1Q header. */ + if (nla_len(nla)) { + OVS_NLERR(log, Truncated 802.1Q header has non-zero encap attribute.); + return -EINVAL; + } + } else { + OVS_NLERR(log, Encap attr is set for non-VLAN frame); + return -EINVAL; + } + + } else { + u64 mask_v_attrs = 0;
[PATCH net-next V10 3/4] 802.1AD: Flow handling, actions and vlan parsing
Add support for 802.1ad including the ability to push and pop double tagged vlans. Signed-off-by: Thomas F Herbert thomasfherb...@gmail.com --- net/openvswitch/flow.c | 82 ++ net/openvswitch/flow.h | 3 ++ 2 files changed, 73 insertions(+), 12 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 2dacc7b..9c73a2e 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -298,21 +298,78 @@ static bool icmp6hdr_ok(struct sk_buff *skb) static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) { struct qtag_prefix { - __be16 eth_type; /* ETH_P_8021Q */ + __be16 eth_type; /* ETH_P_8021Q or ETH_P_8021AD */ __be16 tci; }; - struct qtag_prefix *qp; + struct qtag_prefix *qp = (struct qtag_prefix *)skb-data; - if (unlikely(skb-len sizeof(struct qtag_prefix) + sizeof(__be16))) + struct qinqtag_prefix { + __be16 eth_type; /* ETH_P_8021Q or ETH_P_8021AD */ + __be16 tci; + __be16 inner_tpid; /* ETH_P_8021Q */ + __be16 ctci; + }; + + if (likely(skb_vlan_tag_present(skb))) { + key-eth.tci = htons(skb-vlan_tci); + + /* Case where upstream +* processing has already stripped the outer vlan tag. +*/ + if (unlikely(skb-vlan_proto == htons(ETH_P_8021AD))) { + if (unlikely(skb-len sizeof(struct qtag_prefix) + + sizeof(__be16))) { + key-eth.tci = 0; + return 0; + } + + if (unlikely(!pskb_may_pull(skb, + sizeof(struct qtag_prefix) + + sizeof(__be16 { + return -ENOMEM; + } + + if (likely(qp-eth_type == htons(ETH_P_8021Q))) { + key-eth.ctci = qp-tci | + htons(VLAN_TAG_PRESENT); + __skb_pull(skb, sizeof(struct qtag_prefix)); + } + } return 0; + } - if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) + -sizeof(__be16 - return -ENOMEM; - qp = (struct qtag_prefix *) skb-data; - key-eth.tci = qp-tci | htons(VLAN_TAG_PRESENT); - __skb_pull(skb, sizeof(struct qtag_prefix)); + if (qp-eth_type == htons(ETH_P_8021AD)) { + struct qinqtag_prefix *qinqp = + (struct qinqtag_prefix *)skb-data; + + if (unlikely(skb-len sizeof(struct qinqtag_prefix) + + sizeof(__be16))) + return 0; + + if (unlikely(!pskb_may_pull(skb, sizeof(struct qinqtag_prefix) + + sizeof(__be16 { + return -ENOMEM; + } + key-eth.tci = qinqp-tci | htons(VLAN_TAG_PRESENT); + key-eth.ctci = qinqp-ctci | htons(VLAN_TAG_PRESENT); + + __skb_pull(skb, sizeof(struct qinqtag_prefix)); + + return 0; + } + if (qp-eth_type == htons(ETH_P_8021Q)) { + if (unlikely(skb-len sizeof(struct qtag_prefix) + + sizeof(__be16))) + return -ENOMEM; + + if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) + + sizeof(__be16 + return 0; + key-eth.tci = qp-tci | htons(VLAN_TAG_PRESENT); + + __skb_pull(skb, sizeof(struct qtag_prefix)); + } return 0; } @@ -474,9 +531,10 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) */ key-eth.tci = 0; - if (skb_vlan_tag_present(skb)) - key-eth.tci = htons(skb-vlan_tci); - else if (eth-h_proto == htons(ETH_P_8021Q)) + key-eth.ctci = 0; + if ((skb_vlan_tag_present(skb)) || + (eth-h_proto == htons(ETH_P_8021Q)) || + (eth-h_proto == htons(ETH_P_8021AD))) if (unlikely(parse_vlan(skb, key))) return -ENOMEM; diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h index a076e44..fa83c61 100644 --- a/net/openvswitch/flow.h +++ b/net/openvswitch/flow.h @@ -134,6 +134,9 @@ struct sw_flow_key { u8 src[ETH_ALEN]; /* Ethernet source address. */ u8 dst[ETH_ALEN]; /* Ethernet destination address. */ __be16 tci; /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */ + __be16
Re: [RFC net-next 1/3] net: infra for per-nexthop encap data
Robert Shearman rshea...@brocade.com writes: Having to add a new interface to apply encap onto a packet is a mechanism that works well today, allowing the setup of the encap to be done separately from the routes out of them, meaning that routing protocols and other user-space apps don't need to do anything special to add routes out of a new type of interface. However, the overhead of creating an interface is high, especially in terms of memory. Therefore, the traditional method won't work very well for large numbers of routes applying encap where there is a low degree of sharing of the encap. The solution is to introduce a way of defining encap on a per-nexthop basis (i.e. per-route if only one nexthop) through the addition of a new netlink attribute, RTA_ENCAP. The semantics of this attribute is that the data is interpreted according to the output interface type (RTA_OIF) and is opaque to the normal forwarding path. The output interface doesn't have to be defined per-nexthop, but instead represents the way of encapsulating the packet. There could be as few as one per namespace, but more could be created, particularly if they are used to define parameters which are shared by a large number of routes. However, the split of what goes in the encap data and what might be specified via interface attributes is entirely up to the encap-type implementation. New rtnetlink operations are defined to assist with the management of this data: - parse_encap for parsing the attribute given through rtnl and either sizing the in-memory version (if encap ptr is NULL) or filling in the in-memory version. RTA_ENCAP work for IPv4. This operations allows the interface to reject invalid encap specified by user-space and the sizing allows the kernel to have a different in memory implementation to the netlink API (which might be optimised for extensibility rather than speed of packet forwarding). - fill_encap for taking the in-memory version of the encap and filling in an RTA_ENCAP attribute in a netlink message. - match_encap for comparing an in-memory version of encap with an RTA_ENCAP version, returning 0 if matching or 1 if different. A new dst operation is also defined to allow encap-type interfaces to retrieve the encap data from their xmit functions and use it for encapsulating the packet and for further forwarding. This bit of infrastructure should be more like rtnl_register. Where we register an encap type and the operations to go with it. Just like rtnl_register we can have small array with the operations for each supported encapsulation. Eric Suggested-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Robert Shearman rshea...@brocade.com --- include/linux/rtnetlink.h | 7 +++ include/net/dst.h | 11 +++ include/net/dst_ops.h | 2 ++ include/net/rtnetlink.h| 11 +++ include/uapi/linux/rtnetlink.h | 1 + net/core/rtnetlink.c | 36 6 files changed, 68 insertions(+) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index a2324fb45cf4..470d822ddd61 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -22,6 +22,13 @@ struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev, void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev, gfp_t flags); +int rtnl_parse_encap(const struct net_device *dev, const struct nlattr *nla, + void *encap); +int rtnl_fill_encap(const struct net_device *dev, struct sk_buff *skb, + int encap_len, const void *encap); +int rtnl_match_encap(const struct net_device *dev, const struct nlattr *nla, + int encap_len, const void *encap); + /* RTNL is used as a global lock for all changes to network configuration */ extern void rtnl_lock(void); diff --git a/include/net/dst.h b/include/net/dst.h index 2bc73f8a00a9..df0e6ec18eca 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -506,4 +506,15 @@ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst) } #endif +/* Get encap data for destination */ +static inline int dst_get_encap(struct sk_buff *skb, const void **encap) +{ + const struct dst_entry *dst = skb_dst(skb); + + if (!dst || !dst-ops-get_encap) + return 0; + + return dst-ops-get_encap(dst, encap); +} + #endif /* _NET_DST_H */ diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index d64253914a6a..97f48cf8ef7d 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -32,6 +32,8 @@ struct dst_ops { struct neighbour * (*neigh_lookup)(const struct dst_entry *dst, struct sk_buff *skb, const void *daddr); + int (*get_encap)(const struct
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
On 6/2/15, 11:30 AM, Eric W. Biederman wrote: roopa ro...@cumulusnetworks.com writes: On 6/1/15, 9:46 AM, Robert Shearman wrote: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. RFC because: - IPv6 side not implemented - struct rtable shouldn't be bloated by pointer+uint - Hasn't been thoroughly tested yet Robert Shearman (3): net: infra for per-nexthop encap data ipv4: storing and retrieval of per-nexthop encap mpls: new ipmpls device for encapsulating IP packets as mpls Glad to see these patches!. I have a similar series i have been working on...but no netdevice. A set of ops similar to iptun_encaps and I store encap data in fib_nh and in ip_route_output_slow i point the dst.output to the output func provided by one of the encap ops. I see the advantages of using a netdevice...and i see this align with patches from thomas. roopa I think I would prefer your patches. I thinking using a netdevice the way Robert is proposing is quite possibly a mess, from a scalability stand point. Do you mean ip_route_input_slow? There is no ip_route_output_slow. yes, correct, sorry. I mean ip_route_input_slow. They need work but i will try to get them out today to add more context to the discussion. thanks, Roopa -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
On 6/2/15, 9:33 AM, Robert Shearman wrote: On 02/06/15 17:15, roopa wrote: On 6/1/15, 9:46 AM, Robert Shearman wrote: Allow creating an mpls device for the purposes of encapsulating IP packets with: ip link add type ipmpls This device defines its per-nexthop encapsulation data as a stack of labels, in the same format as for RTA_NEWST. It uses the encap data which will have been stored in the IP route to encapsulate the packet with that stack of labels, with the last label corresponding to a local label that defines how the packet will be sent out. The device sends packets over loopback to the local MPLS forwarding logic which performs all of the work. Maybe a silly question, but when you loop the packet back, what does the local MPLS forwarding logic lookup with ? It probably assumes there is a mpls route with that label and nexthop. Will this need any internal labels (thinking same label stack different tunnel device etc) ? Yes, it requires that local/internal labels have been allocated and label routes installed in the label table for them. This is our only concern. It is entirely possible to put the outgoing interface into the encap data to avoid having to allocate extra labels, but I did it this way in order to support PIC Core for MPLS-VPN routes. hmm..., is a netdevice must in this case.., can you please elaborate on this ?. Note: I have two extra patches which avoid using the loopback device (which causes the TTL to end up being one less than it should on output), but I haven't posted them here because they were dependent on other mpls changes in my tree. ok, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/25] Convert the posix_clock_operations and k_clock structure to ready for 2038
On Mon, 1 Jun 2015, Baolin Wang wrote: This patch series changes the 32-bit time types (timespec/itimerspec) to the 64-bit types (timespec64/itimerspec64), since 32-bit time types will break in the year 2038. That's only true for 32bit systems. All in all the patch series looks rather reasonable now, except for the subject lines and the changelogs. The only technical objection I have is the macro conversion magic in patch #6. This can be done in a less cryptic and more efficient way. See the comments to the various patches and please apply them to all of the series. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
On 02/06/15 19:11, Eric W. Biederman wrote: Robert Shearman rshea...@brocade.com writes: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. I am still digging into the details but adding a new network device to make this possible if very undesirable. It is a pain point. Those network devices get to be a major source of memory consumption when there are 4K network namespaces in existence. It is conceptually wrong. The network device will never be used as an ordinary network device. All the network device gives you is the ability to avoid creating an enumeration of different kinds of encapsulation. This isn't true. The network device also gives some of the things you take for granted. Things like fragmentation through specifying the mtu on the shared tunnel device, being able to specify rules using the shared tunnel output device, IP stats, and the ability specify a different destination namespace. Thanks, Rob -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
Robert Shearman rshea...@brocade.com writes: On 02/06/15 19:11, Eric W. Biederman wrote: Robert Shearman rshea...@brocade.com writes: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. I am still digging into the details but adding a new network device to make this possible if very undesirable. It is a pain point. Those network devices get to be a major source of memory consumption when there are 4K network namespaces in existence. It is conceptually wrong. The network device will never be used as an ordinary network device. All the network device gives you is the ability to avoid creating an enumeration of different kinds of encapsulation. This isn't true. The network device also gives some of the things you take for granted. Things like fragmentation through specifying the mtu on the shared tunnel device, being able to specify rules using the shared tunnel output device, IP stats, and the ability specify a different destination namespace. Granted you get a few more things. It is still conceptually wrong as the network device will netver be used as an ordinary network device. Fragmentation is already silly because we are talking about multiple tunnels with different properties. You need per-route mtu to handle that case. Further I am not saying you don't need an output device (which is what is needed to specify a different destination namespace) I am saying that having a funny mpls device is wrong as far as I can see. Certainly it is a lot of bloody unnecessary overhead. If we are going to design for maximum scaling (and 1 million+ routes) sounds like maximum scaling we should see how far we can go without dragging in the horrible heaviness of additional network devices. 35K a piece last I measured it. Just a small handful of them are already scaling issues for network namespaces. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ray_cs: Change 1 to true for bool type variable.
The variable translate is bool type. So assigning true instead of 1. Signed-off-by: Shailendra Verma shailendra.capric...@gmail.com Thanks, applied to wireless-drivers-next.git. Kalle Valo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/2] Renesas Ethernet AVB driver proper
Ethernet AVB includes an Gigabit Ethernet controller (E-MAC) that is basically compatible with SuperH Gigabit Ethernet E-MAC. Ethernet AVB has a dedicated direct memory access controller (AVB-DMAC) that is a new design compared to the SuperH E-DMAC. The AVB-DMAC is compliant with 3 standards formulated for IEEE 802.1BA: IEEE 802.1AS timing and synchronization protocol, IEEE 802.1Qav real- time transfer, and the IEEE 802.1Qat stream reservation protocol. The driver only supports device tree probing, so the binding document is included in this patch. Based on the original patches by Mitsuhiro Kimura. Signed-off-by: Mitsuhiro Kimura mitsuhiro.kimura...@renesas.com Signed-off-by: Sergei Shtylyov sergei.shtyl...@cogentembedded.com --- This patch is against David Miller's 'net-next.git' repo. Changes in version 5: - switched to calling multiqueue APIs, implementing ndo_select_queue() method; - fixed the ring-full check in ravb_start_xmit() and turned it into a sanity check, adjusting the error message and promoting dev_warn() to dev_err(); - fixed skb_put_padto() failure path to drop the packet; - moved the 'priv-cur_tx' increment after enqueuing the packet in ravb_start_xmit(); - added the ring-full check after the packet gets enqueued in ravb_start_xmit(), moved ravb_tx_free() call from the sanity check to this new code; - moved mmiowb() call after the 'exit' label in ravb_start_xmit(); - removed superfluous netif_tx_disable() call from ravb_set_ringparam(). Changes in version 4: - switched from the bit fields in the descriptor structures to normal fields and masks; - declared 16/32-bit descriptor fields as '__le{16/32}' and started using cpu_to_le{16|32}() and le{16|32}_to_cpu() when accessing them; - started registering/deregistering the PTP driver each time the AVB-DMAC enters/leaves the operation mode instead of navb_{probe|remove}(), and thus removed ravb_ptp_is_config() and also the checks in ravb_ptp_tcr_request(), ravb_ptp_time_write(), and ravb_ptp_update_addend(); - folded ravb_free_dma_buffer() into ravb_ring_free(), clarified the comment to the latter function; was then able to simplify the error cleanup path in ravb_ring_init(); - fixed totally brain damaged ravb_tx_timeout() by first stopping DMA and then calling ravb_ring_free() and moving most of the code into work-queue function, - started calling ravb_ring_init() from ravb_dmac_init(); - started allocating the TX buffers with GPF_KERNEL instead of GPF_ATOMIC; - started checking the result of ravb_wait() where it was previously ignored; - propagated errors from ravb_wait() calls outside ravb_ptp_tcr_request(); - propagated errors from ravb_ptp_tcr_request() calls outside ravb_ptp_time_{read|write}(); - propagated errors from ravb_ptp_time_{read|write}() outside ravb_ptp_adjtime() and ravb_ptp_{get|set}time64(); - switched from using ravb_wait() after setting a bit in the GCCR register to just checking if a bit is zero before setting it in ravb_ptp_time_write(), ravb_ptp_select_counter(), and ravb_ptp_update_addend(); - added check for ravb_ptp_update_compare() failure to ravb_ptp_perout() and propagate the error outside this function; - added mmiowb() calls before releasing the spinlock; - merged the spinlock release code from different *if* statement branches in ravb_ptp_ptp_perout(); - fixed the 'reg' parameter type in ravb_wait(); - fixed the result type for ravb_start_xmit(); - fixed the 'request' parameter type in ravb_ptp_tcr_request(); - added the 'size' local variable in ravb_tx_free(); - fixed TX ring cleanup threshold in ravb_start_xmit(); - fixed kmalloc() error cleanup in ravb_start_xmit(); - fixed ravb_start_xmit() racing with the interrupt handler and NAPI poller by holding spinlock till the end of ravb_start_xmit(); - factored out the GIC interrupt handling into ravb_ptp_interrupt(); - added the 'dma_addr' local variable in ravb_start_xmit(); - switched from '%=' operator to '=' in ravb_start_xmit(); - changed the format specifier in ravb_tx_timeout(); - removed useless type cast from ravb_rx(); - acquired the spinlock earlier in ravb_poll(); - made ravb_ptp_init() return *void* since its result isn't checked anyway; - removed 'ravb_private::rx_buffer_size' since the RX buffer size should be constant; - removed unused 'ravb_private::edmac_endian'; - expanded update_mac_address() inline at its only call site; - expanded ravb_ptp_cnt_{read|write}() inline at their single call sites; - expanded ravb_ptp_cnt_select_counter() inline at its only call site; - expanded ravb_ptp_update_addend() at its only call site; - added 'ravb_' prefix to read_mac_address(); - renamed ravb_wait_stop_dma() to ravb_stop_dma(); - also disabled E-MAC TX in ravb_stop_dma() by calling ravb_rcv_snd_disable(); - moved the ravb_config() call from ravb_stop_dma() callers to this function; - removed redundant register reads/writes in ravb_set_duplex(); - converted the 'new_state' local variable from *int* to
[PATCH v5 2/2] Renesas Ethernet AVB PTP clock driver
Ethernet AVB device includes the gPTP timer, so we can implement a PTP clock driver. We're doing that in a separate file, with the main Ethernet driver calling the PTP driver's [de]initialization and interrupt handler functions. Unfortunately, the clock seems tightly coupled with the AVB-DMAC, so when that one leaves the operation mode, we have to unregister the PTP clock... :-( Based on the original patches by Masaru Nagai. Signed-off-by: Masaru Nagai masaru.nagai...@renesas.com Signed-off-by: Sergei Shtylyov sergei.shtyl...@cogentembedded.com --- This patch is against David Miller's 'net-next.git' repo. Changes in version 5: - resolved rejects, refreshed the patch. Changes in version 4: - new patch, split from the main Ethernet driver patch. drivers/net/ethernet/renesas/Makefile |2 drivers/net/ethernet/renesas/ravb.c | 33 ++ drivers/net/ethernet/renesas/ravb.h | 26 ++ drivers/net/ethernet/renesas/ravb_ptp.c | 357 4 files changed, 412 insertions(+), 6 deletions(-) Index: net-next/drivers/net/ethernet/renesas/Makefile === --- net-next.orig/drivers/net/ethernet/renesas/Makefile +++ net-next/drivers/net/ethernet/renesas/Makefile @@ -3,4 +3,4 @@ # obj-$(CONFIG_SH_ETH) += sh_eth.o -obj-$(CONFIG_RAVB) += ravb.o +obj-$(CONFIG_RAVB) += ravb.o ravb_ptp.o Index: net-next/drivers/net/ethernet/renesas/ravb.c === --- net-next.orig/drivers/net/ethernet/renesas/ravb.c +++ net-next/drivers/net/ethernet/renesas/ravb.c @@ -28,7 +28,6 @@ #include linux/of_irq.h #include linux/of_mdio.h #include linux/of_net.h -#include linux/platform_device.h #include linux/pm_runtime.h #include linux/slab.h #include linux/spinlock.h @@ -41,8 +40,7 @@ NETIF_MSG_RX_ERR | \ NETIF_MSG_TX_ERR) -static int ravb_wait(struct net_device *ndev, enum ravb_reg reg, u32 mask, -u32 value) +int ravb_wait(struct net_device *ndev, enum ravb_reg reg, u32 mask, u32 value) { int i; @@ -785,6 +783,9 @@ static irqreturn_t ravb_interrupt(int ir result = IRQ_HANDLED; } + if (iss ISS_CGIS) + result = ravb_ptp_interrupt(ndev); + mmiowb(); spin_unlock(priv-lock); return result; @@ -1124,6 +1125,8 @@ static int ravb_set_ringparam(struct net if (netif_running(ndev)) { netif_device_detach(ndev); + /* Stop PTP Clock driver */ + ravb_ptp_stop(ndev); /* Wait for DMA stopping */ error = ravb_stop_dma(ndev); if (error) { @@ -1153,6 +1156,9 @@ static int ravb_set_ringparam(struct net ravb_emac_init(ndev); + /* Initialise PTP Clock driver */ + ravb_ptp_init(ndev, priv-pdev); + netif_device_attach(ndev); } @@ -1162,6 +1168,8 @@ static int ravb_set_ringparam(struct net static int ravb_get_ts_info(struct net_device *ndev, struct ethtool_ts_info *info) { + struct ravb_private *priv = netdev_priv(ndev); + info-so_timestamping = SOF_TIMESTAMPING_TX_SOFTWARE | SOF_TIMESTAMPING_RX_SOFTWARE | @@ -1174,7 +1182,7 @@ static int ravb_get_ts_info(struct net_d (1 HWTSTAMP_FILTER_NONE) | (1 HWTSTAMP_FILTER_PTP_V2_L2_EVENT) | (1 HWTSTAMP_FILTER_ALL); - info-phc_index = -1; + info-phc_index = ptp_clock_index(priv-ptp.clock); return 0; } @@ -1215,15 +1223,21 @@ static int ravb_open(struct net_device * goto out_free_irq; ravb_emac_init(ndev); + /* Initialise PTP Clock driver */ + ravb_ptp_init(ndev, priv-pdev); + netif_tx_start_all_queues(ndev); /* PHY control start */ error = ravb_phy_start(ndev); if (error) - goto out_free_irq; + goto out_ptp_stop; return 0; +out_ptp_stop: + /* Stop PTP Clock driver */ + ravb_ptp_stop(ndev); out_free_irq: free_irq(ndev-irq, ndev); out_napi_off: @@ -1254,6 +1268,9 @@ static void ravb_tx_timeout_work(struct netif_tx_stop_all_queues(ndev); + /* Stop PTP Clock driver */ + ravb_ptp_stop(ndev); + /* Wait for DMA stopping */ ravb_stop_dma(ndev); @@ -1264,6 +1281,9 @@ static void ravb_tx_timeout_work(struct ravb_dmac_init(ndev); ravb_emac_init(ndev); + /* Initialise PTP Clock driver */ + ravb_ptp_init(ndev, priv-pdev); + netif_tx_start_all_queues(ndev); } @@ -1428,6 +1448,9 @@ static int ravb_close(struct net_device ravb_write(ndev, 0, RIC2); ravb_write(ndev, 0, TIC); + /* Stop PTP Clock driver */ + ravb_ptp_stop(ndev); + /* Set the config mode to stop the
[PATCH net-next 3/3] net/mlx4_core: fix typo in mlx4_set_vf_mac
From: Carol Soto cls...@linux.vnet.ibm.com fix typo in mlx4_set_vf_mac Acked-by: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 91d8344..68ae765 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -2917,7 +2917,7 @@ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac) port = mlx4_slaves_closest_port(dev, slave, port); s_info = priv-mfunc.master.vf_admin[slave].vport[port]; s_info-mac = mac; - mlx4_info(dev, default mac on vf %d port %d to %llX will take afect only after vf restart\n, + mlx4_info(dev, default mac on vf %d port %d to %llX will take effect only after vf restart\n, vf, port, s_info-mac); return 0; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/3] net/mlx4_core: need to call close fw if alloc icm is called twice
From: Carol Soto cls...@linux.vnet.ibm.com If mlx4_enable_sriov is called by adapter without this feature MLX4_DEV_CAP_FLAG2_SYS_EQS then during this path the function alloc icm is called twice without freeing the structures from the first time. Acked-by: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com --- drivers/net/ethernet/mellanox/mlx4/main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 9485cbe..7d5 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2976,6 +2976,7 @@ slave_start: existing_vfs, reset_flow); + mlx4_close_fw(dev); mlx4_cmd_cleanup(dev, MLX4_CMD_CLEANUP_ALL); dev-flags = dev_flags; if (!SRIOV_VALID_STATE(dev-flags)) { -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/3] net/mlx4_core: double free of dev_vfs
From: Carol L Soto cls...@linux.vnet.ibm.com If user loads mlx4_core with num_vfs greater than supported then variable dev-dev_vfs is freed 2 times after unloading the driver. Acked-by: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Carol L Soto cls...@linux.vnet.ibm.com --- drivers/net/ethernet/mellanox/mlx4/main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 0dbd704..9485cbe 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2824,6 +2824,7 @@ disable_sriov: free_mem: dev-persist-num_vfs = 0; kfree(dev-dev_vfs); +dev-dev_vfs = NULL; return dev_flags ~MLX4_FLAG_MASTER; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v3 00/15] sfc: ndo_get_phys_port_id, vadaptor stats and PF unload when Vf's assigned to guest
From: Shradha Shah ss...@solarflare.com Date: Tue, 2 Jun 2015 11:36:00 +0100 This is the third and last instalment of SRIOV for EF10 patches. This patch set includes implementation of ndo_get_phys_port_id and changes to the MAC statistics code in order to support vadaptor statistics. It also includes code to deal with PF unload when Vf's are still assigned to the guest. The first couple of patches create sysfs files for physical port and link control flags which are particularly useful when we have enabled a large number of VF's. These patches have been tested with and without CONFIG_SFC_SRIOV. The creation and content of the sysfs files has been tested. The statistics are tested using ethtool for monitoring. Series applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
On 02/06/15 19:57, roopa wrote: On 6/2/15, 9:33 AM, Robert Shearman wrote: On 02/06/15 17:15, roopa wrote: On 6/1/15, 9:46 AM, Robert Shearman wrote: Allow creating an mpls device for the purposes of encapsulating IP packets with: ip link add type ipmpls This device defines its per-nexthop encapsulation data as a stack of labels, in the same format as for RTA_NEWST. It uses the encap data which will have been stored in the IP route to encapsulate the packet with that stack of labels, with the last label corresponding to a local label that defines how the packet will be sent out. The device sends packets over loopback to the local MPLS forwarding logic which performs all of the work. Maybe a silly question, but when you loop the packet back, what does the local MPLS forwarding logic lookup with ? It probably assumes there is a mpls route with that label and nexthop. Will this need any internal labels (thinking same label stack different tunnel device etc) ? Yes, it requires that local/internal labels have been allocated and label routes installed in the label table for them. This is our only concern. It is entirely possible to put the outgoing interface into the encap data to avoid having to allocate extra labels, but I did it this way in order to support PIC Core for MPLS-VPN routes. hmm..., is a netdevice must in this case.., can you please elaborate on this ?. Yes, the ipmpls device would still be used to perform the encapsulation, transitioning from the IP forwarding path to the MPLS forwarding path. Thanks, Rob -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote: As far as I can tell, enabling IP_RECVERR causes the presence of a queued error to cause recvmsg, etc to return an error (once). It's worse, though: a new error can be queued asynchronously at any time, this setting sk_err to a nonzero value. How do I sensibly distinguish recvmsg failures to to genuine errors receiving messages from recvmsg failures because there's a queued error? The only way I can see to get reliable error handling is to literally call recvmsg in a loop: while (true /* or while POLLIN is set */) { int ret = recvmsg(..., MSG_ERRQUEUE not set); if (ret 0 /* what goes here? */) { whoops! this might be a harmless asynchronous error! take no action! } I see either two possibilities: We export the icmp_err_convert tables along with the udp_lib_err error conversions to user space and spice them up with flags to mark if they are transient (icmp_err_convert already has a fatal flag). Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after you got a ret 0 when calling without MSG_ERRQUEUE and inspect the sock_extended_err, no? Bye, Hannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Wed, Jun 3, 2015, at 02:03, Andy Lutomirski wrote: On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: My proposal would be to make the error conversion lazy: Keeping duplicate data is not a good idea in general: So we shouldn't use sk-sk_err if IP_RECVERR is set at all but let sock_error just use the sk_error_queue and extract the error code from there. Only if IP_RECVERR was not set, we use sk-sk_err logic. What do you think? I just noticed that this will probably break existing user space applications which require that icmp errors are transient even with IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer and xchg the pointer, hmmm Do you mean to fix the race like this but to otherwise leave the semantics alone? That would be an improvement, but it might be nice to also add a non-crappy API for this, too. Yes, keep current semantics but fix the race you reported. I currently don't have good proposals for a decent API to handle this besides adding some ancillary cmsg data to msg_control. This still would not solve the problem fundamentally, as a -EFAULT/-EINVAL return value could also mean that msg_control should not be touched, thus we end up again relying on errno checking. :/ Thus checking error queue after receiving an error indications is my best hunch so far. Your proposal with MSG_IGNORE_ERROR seems reasonable so far for ping or udp, but I haven't fully grasped the TCP semantics of sk-sk_err, yet. Bye, Hannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/9] net: dsa: add basic support for VLAN operations
Hi Guenter, On Jun 2, 2015, at 10:42 AM, Guenter Roeck li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: This patch adds the glue between DSA and switchdev to add and delete SWITCHDEV_OBJ_PORT_VLAN objects. This will allow the DSA switch drivers implementing the port_vlan_add and port_vlan_del functions to access the switch VLAN database through userspace commands such as bridge vlan. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com --- include/net/dsa.h | 7 +++ net/dsa/slave.c | 61 +-- 2 files changed, 66 insertions(+), 2 deletions(-) diff --git a/include/net/dsa.h b/include/net/dsa.h index fbca63b..726357b 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -302,6 +302,13 @@ struct dsa_switch_driver { const unsigned char *addr, u16 vid); int (*fdb_getnext)(struct dsa_switch *ds, int port, unsigned char *addr, bool *is_static); + +/* + * VLAN support + */ +int (*port_vlan_add)(struct dsa_switch *ds, int port, u16 vid, + u16 bridge_flags); +int (*port_vlan_del)(struct dsa_switch *ds, int port, u16 vid); }; void register_switch_driver(struct dsa_switch_driver *type); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index cbda00a..52ba5a1 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -363,6 +363,25 @@ static int dsa_slave_port_attr_set(struct net_device *dev, return ret; } +static int dsa_slave_port_vlans_add(struct net_device *dev, +struct switchdev_obj_vlan *vlan) +{ +struct dsa_slave_priv *p = netdev_priv(dev); +struct dsa_switch *ds = p-parent; +int vid, err = 0; + +if (!ds-drv-port_vlan_add) +return -ENOTSUPP; + +for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) { +err = ds-drv-port_vlan_add(ds, p-port, vid, vlan-flags); +if (err) +break; +} + +return err; +} + static int dsa_slave_port_obj_add(struct net_device *dev, struct switchdev_obj *obj) { @@ -378,6 +397,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev, return 0; switch (obj-id) { +case SWITCHDEV_OBJ_PORT_VLAN: +err = dsa_slave_port_vlans_add(dev, obj-u.vlan); +break; default: err = -ENOTSUPP; break; @@ -386,12 +408,34 @@ static int dsa_slave_port_obj_add(struct net_device *dev, return err; } +static int dsa_slave_port_vlans_del(struct net_device *dev, +struct switchdev_obj_vlan *vlan) +{ +struct dsa_slave_priv *p = netdev_priv(dev); +struct dsa_switch *ds = p-parent; +int vid, err = 0; + +if (!ds-drv-port_vlan_del) +return -ENOTSUPP; + +for (vid = vlan-vid_start; vid = vlan-vid_end; ++vid) { +err = ds-drv-port_vlan_del(ds, p-port, vid); +if (err) +break; +} + +return err; +} + static int dsa_slave_port_obj_del(struct net_device *dev, struct switchdev_obj *obj) { int err; switch (obj-id) { +case SWITCHDEV_OBJ_PORT_VLAN: +err = dsa_slave_port_vlans_del(dev, obj-u.vlan); +break; default: err = -EOPNOTSUPP; break; @@ -473,6 +517,15 @@ static netdev_tx_t dsa_slave_notag_xmit(struct sk_buff *skb, return NETDEV_TX_OK; } +static int dsa_slave_vlan_noop(struct net_device *dev, __be16 proto, u16 vid) +{ +/* NETIF_F_HW_VLAN_CTAG_FILTER requires ndo_vlan_rx_add_vid and + * ndo_vlan_rx_kill_vid, otherwise the VLAN acceleration is considered + * buggy (see net/core/dev.c). + */ +return 0; +} + /* ethtool operations ***/ static int @@ -734,6 +787,10 @@ static const struct net_device_ops dsa_slave_netdev_ops = { .ndo_fdb_dump = dsa_slave_fdb_dump, .ndo_do_ioctl = dsa_slave_ioctl, .ndo_get_iflink = dsa_slave_get_iflink, +.ndo_vlan_rx_add_vid= dsa_slave_vlan_noop, +.ndo_vlan_rx_kill_vid = dsa_slave_vlan_noop, +.ndo_bridge_setlink = switchdev_port_bridge_setlink, +.ndo_bridge_dellink = switchdev_port_bridge_dellink, }; static const struct switchdev_ops dsa_slave_switchdev_ops = { @@ -924,7 +981,7 @@ int dsa_slave_create(struct dsa_switch *ds, struct device *parent, if (slave_dev == NULL) return -ENOMEM; -slave_dev-features = master-vlan_features; +slave_dev-features = master-vlan_features | NETIF_F_VLAN_FEATURES; Hi Vivien, NETIF_F_VLAN_FEATURES declares that the device supports receive and transmit tagging offload. We do this on transmit, by calling vlan_hwaccel_push_inside() with patch 9, but not on the receive side. I think you may need to add matching code on the receive side to remove the VLAN
Re: [PATCH v2 net-next] vlan: Add GRO support for non hardware accelerated vlan
On Mon, Jun 01, 2015 at 02:56:25PM -0700, David Miller wrote: From: Eric Dumazet eric.duma...@gmail.com Date: Mon, 01 Jun 2015 07:12:37 -0700 Can we ensure offload_base contains a sensible order of expected types ? This seemed easy enough to kill, so I pushed the following into net-next: [PATCH] net: Add priority to packet_offload objects. When we scan a packet for GRO processing, we want to see the most common packet types in the front of the offload_base list. So add a priority field so we can handle this properly. IPv4/IPv6 get the highest priority with the implicit zero priority field. Next comes ethernet with a priority of 10, and then we have the MPLS types with a priority of 15. FWIW I have no objections to the priority assigned to MPLS. Suggested-by: Eric Dumazet eric.duma...@gmail.com Suggested-by: Toshiaki Makita makita.toshi...@lab.ntt.co.jp Signed-off-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/3] DSA and Marvell 88E6352 802.1q support
Hi Scott, On Jun 2, 2015, at 2:18 AM, Scott Feldman sfel...@gmail.com wrote: On Mon, Jun 1, 2015 at 5:18 PM, Vivien Didelot vivien.dide...@savoirfairelinux.com wrote: On May 29, 2015, at 1:02 AM, Scott Feldman sfel...@gmail.com wrote: On Thu, May 28, 2015 at 2:37 PM, Vivien Didelot vivien.dide...@savoirfairelinux.com wrote: This RFC is based on v4.1-rc3. It is meant to get a glance to the commits responsible to implement the necessary NDOs between DSA and the Marvell 88E6352 switch driver. With this support, I am able to create VLANs with (un)tagged ports, setting their default VID, from a bridge. To create a bridge containing all switch ports, with a VLAN ID 400, swp2 and swp3 untagged (pvid), and swp4 tagged, the userspace commands look like this: ip link add name br0 type bridge [...] ip link set dev swp2 up master br0 [...] bridge vlan add vid 400 pvid untagged dev swp2 bridge vlan add vid 400 pvid untagged dev swp3 bridge vlan add vid 400 dev swp4 [...] ip link add link br0 name br0.400 type vlan id 400 [...] bridge vlan add dev br0 vid 400 self The code is currently being rebased to the latest net-next/master. Seems like the way to go now is through switchdev attr getter/setter... Indeed, for dsa_slave you should be able to port this to switchdev and set your ndo_bridge_setlink/dellink handlers to switchdev_port_bridge_setlink/dellink. (And also implement the switchdev ops for vlans). If you use switchdev_port_bridge_setlink/dellink, you shouldn't need to implement ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid at all. Scott, In fact I have to define these ndo, otherwise I get the Buggy VLAN acceleration in driver! warning from net/core/dev.c and the switch ports won't register. I'm actually defining a noop function for them in dsa_slave_netdev_ops. Is it correct to set NETIF_F_HW_VLAN_CTAG_FILTER in slave_dev-features? If your nooping ndo VLAN ops then just remove setting NETIF_F_HW_VLAN_CTAG_FILTER and then you can remove the noop funcs. The setlink/dellink callbacks will give the same info (and more, e.g. pvid, untagged flags) and you'll automatically get support for stacked drivers, for example if you bonded swp2/3 and then included that bond in your vlan bridge. Your commands will be slightly modified: when adding the vid to the port, specify master and self: bridge vlan add vid 400 dev swp4 master self Thanks it works! Now the switch VLAN database is consistent with the bridge commands, I'm sending a complete RFC very soon. Scott, David, I use this mail to expose a potential problem between iproute2 and the kernel, found with my previous code. When issuing ip link set dev swp0 master br0, ndo_vlan_rx_add_vid is called, but not ndo_bridge_setlink, Remove NETIF_F_HW_VLAN_CTAG_FILTER and ndo_vlan_rx_add_vid will not be called. Issuing ip link set dev swp0 master br0 should only be setting the bridge member, not setting up any VLAN. I suspect when you did this swp0 was admin UP and you're getting untagged VLAN 0 installed, which is the call to ndo_vlan_rx_add_vid. which results in an inconsistency between my switch VLAN database (and port settings) and bridge vlan, which shows swp0 1 PVID Egress Untagged. So that is a result of /sys/class/net/br0/bridge/default_pvid set to 1. If you don't want that, turn default_pvid off: echo 0 /sys/class/net/br0/bridge/default_pvid Now you'll see None in the bridge vlan output. Seems like there is a call to ndo_bridge_setlink to add somewhere, but I have no clue where. In the meantime, I call bridge vlan add vid 1 dev swp0 pvid untagged [master self] at boot, to be consistent with the bridge output. Or turn off default_pvid. Thanks, I confirm both fixes work. Thanks a lot. -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 2/2] pci: Add VPD quirk for Intel Ethernet devices
This quirk sets the PCI_DEV_FLAGS_VPD_REF_F0 flag on all Intel Ethernet device functions other than function 0. Signed-off-by: Mark Rustad mark.d.rus...@intel.com --- drivers/pci/quirks.c |9 + 1 file changed, 9 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index c6dc1dfd25d5..9ddf6a533f4f 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -1903,6 +1903,15 @@ static void quirk_netmos(struct pci_dev *dev) DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NETMOS, PCI_ANY_ID, PCI_CLASS_COMMUNICATION_SERIAL, 8, quirk_netmos); +static void quirk_f0_vpd_link(struct pci_dev *dev) +{ + if (!PCI_FUNC(dev-devfn)) + return; + dev-dev_flags |= PCI_DEV_FLAGS_VPD_REF_F0; +} +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, PCI_ANY_ID, + PCI_CLASS_NETWORK_ETHERNET, 8, quirk_f0_vpd_link); + static void quirk_e100_interrupt(struct pci_dev *dev) { u16 command, pmcsr; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015 at 5:33 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Wed, Jun 3, 2015, at 02:03, Andy Lutomirski wrote: On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: My proposal would be to make the error conversion lazy: Keeping duplicate data is not a good idea in general: So we shouldn't use sk-sk_err if IP_RECVERR is set at all but let sock_error just use the sk_error_queue and extract the error code from there. Only if IP_RECVERR was not set, we use sk-sk_err logic. What do you think? I just noticed that this will probably break existing user space applications which require that icmp errors are transient even with IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer and xchg the pointer, hmmm Do you mean to fix the race like this but to otherwise leave the semantics alone? That would be an improvement, but it might be nice to also add a non-crappy API for this, too. Yes, keep current semantics but fix the race you reported. I currently don't have good proposals for a decent API to handle this besides adding some ancillary cmsg data to msg_control. This still would not solve the problem fundamentally, as a -EFAULT/-EINVAL return value could also mean that msg_control should not be touched, thus we end up again relying on errno checking. :/ Thus checking error queue after receiving an error indications is my best hunch so far. Your proposal with MSG_IGNORE_ERROR seems reasonable so far for ping or udp, but I haven't fully grasped the TCP semantics of sk-sk_err, yet. I always assumed that TCP didn't have transient errors. Shouldn't a connection either be up or down but not up with errors? If that's wrong, then it's probably worth understanding what's going on before trying to design a fix. --Andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 23:42, Hannes Frederic Sowa wrote: On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote: On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote: [...] I do this already, which makes me think that there's a bug or another race somewhere. I've only seen a failure once in several years of operation. The failure happened on a ping socket. I suspect that the race is: ping_err: ip_icmp_error(...); user: recvmsg(MSG_ERRQUEUE) and dequeues the error. ping_err: sk_err = err; user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the error via sock_error. user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN. Now the user code thinks that it was a real (non-transient) error and aborts. Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE? Hmm, I don't think this will help. Even if this race were fixed, this interface still sucks IMO. Yes. :/ My proposal would be to make the error conversion lazy: Keeping duplicate data is not a good idea in general: So we shouldn't use sk-sk_err if IP_RECVERR is set at all but let sock_error just use the sk_error_queue and extract the error code from there. Only if IP_RECVERR was not set, we use sk-sk_err logic. What do you think? I just noticed that this will probably break existing user space applications which require that icmp errors are transient even with IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer and xchg the pointer, hmmm Do you mean to fix the race like this but to otherwise leave the semantics alone? That would be an improvement, but it might be nice to also add a non-crappy API for this, too. --Andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 5/9] net: dsa: mv88e6352: disable mirroring
Hi Guenter, Andrew, On Jun 2, 2015, at 10:53 AM, Andrew Lunn and...@lunn.ch wrote: On Tue, Jun 02, 2015 at 07:16:10AM -0700, Guenter Roeck wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: Disable the mirroring policy in the monitor control register, since this feature is not needed. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com Should this be a separate patch, unrelated to the patch set ? Indeed, this one is an unrelated patch, sorry. If I understand correctly, this effectively disables IGMP/MLD snooping. I think this warrants an explanation why that it not needed, not just a statement that it is not needed. +1 Especially since we might want to revisit this to implement IGMP/MLD snooping in the bridge. The hardware should be capable of it. This is something I want to disable because I can have several times gigabit traffic on my ports. This would end up in a bottleneck on the CPU port. Am I right? Thanks, -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net iproute2 v4] mpls: always set type RTN_UNICAST and scope RT_SCOPE_UNIVERSE for route add/deletes
Roopa Prabhu ro...@cumulusnetworks.com writes: From: Roopa Prabhu ro...@cumulusnetworks.com This patch fixes incorrect -EINVAL errors due to invalid scope and type during mpls route deletes. $ip -f mpls route add 100 as 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route show 100 as to 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1 RTNETLINK answers: Invalid argument $ip -f mpls route del 100 RTNETLINK answers: Invalid argument After patch: $ip -f mpls route show 100 as to 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route show Always set type to RTN_UNICAST for mpls route add/deletes. Also to keep things consistent with kernel set scope to RT_SCOPE_UNIVERSE for both mpls and ipv6 routes. Both mpls and ipv6 route deletes ignore scope. Acked-by: Eric W. Biederman ebied...@xmission.com Suggested-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com Signed-off-by: Vivek Venkataraman vi...@cumulusnetworks.com -- v4 move fix to iproute2 --- ip/iproute.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 670a4c6..d0b9910 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) int scope_ok = 0; int table_ok = 0; int raw = 0; + int type_ok = 0; memset(req, 0, sizeof(req)); @@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) rtnl_rtntype_a2n(type, *argv) == 0) { NEXT_ARG(); req.r.rtm_type = type; + type_ok = 1; } if (matches(*argv, help) == 0) @@ -1136,6 +1138,9 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) if (nhs_ok) parse_nexthops(req.n, req.r, argc, argv); + if (req.r.rtm_family == AF_UNSPEC) + req.r.rtm_family = AF_INET; + if (!table_ok) { if (req.r.rtm_type == RTN_LOCAL || req.r.rtm_type == RTN_BROADCAST || @@ -1144,8 +1149,11 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) req.r.rtm_table = RT_TABLE_LOCAL; } if (!scope_ok) { - if (req.r.rtm_type == RTN_LOCAL || - req.r.rtm_type == RTN_NAT) + if (req.r.rtm_family == AF_INET6 || + req.r.rtm_family == AF_MPLS) + req.r.rtm_scope = RT_SCOPE_UNIVERSE; + else if (req.r.rtm_type == RTN_LOCAL || + req.r.rtm_type == RTN_NAT) req.r.rtm_scope = RT_SCOPE_HOST; else if (req.r.rtm_type == RTN_BROADCAST || req.r.rtm_type == RTN_MULTICAST || @@ -1160,8 +1168,8 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) } } - if (req.r.rtm_family == AF_UNSPEC) - req.r.rtm_family = AF_INET; + if (!type_ok req.r.rtm_family == AF_MPLS) + req.r.rtm_type = RTN_UNICAST; if (rtnl_talk(rth, req.n, 0, 0, NULL) 0) return -2; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 8/9] net: dsa: mv88e6352: set port 802.1Q mode to Secure
On 06/01/2015 06:27 PM, Vivien Didelot wrote: This commit changes the 802.1Q mode of each port from Disabled to Secure. This enables the VLAN support, by checking the VTU entries on ingress. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com --- drivers/net/dsa/mv88e6xxx.c | 14 +++--- drivers/net/dsa/mv88e6xxx.h | 5 + 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index ed49bd8..35243d8 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -1723,13 +1723,11 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) goto abort; } - /* Port Control 2: don't force a good FCS, set the maximum -* frame size to 10240 bytes, don't let the switch add or -* strip 802.1q tags, don't discard tagged or untagged frames -* on this port, do a destination address lookup on all -* received packets as usual, disable ARP mirroring and don't -* send a copy of all transmitted/received frames on this port -* to the CPU. + /* Port Control 2: don't force a good FCS, set the maximum frame size to +* 10240 bytes, enable secure 802.1q tags, don't discard tagged or +* untagged frames on this port, do a destination address lookup on all +* received packets as usual, disable ARP mirroring and don't send a +* copy of all transmitted/received frames on this port to the CPU. */ reg = 0; if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) || @@ -1751,6 +1749,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) reg |= PORT_CONTROL_2_FORWARD_UNKNOWN; } + reg |= PORT_CONTROL_2_8021Q_SECURE; + Vivien, With this patch, my non-VLAN configuration fails; it appears that untagged packets are no longer received. I found two possible solutions: - Use PORT_CONTROL_2_8021Q_FALLBACK - Explicitly add a VLAN entry for vid=0. Guenter -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
Thomas Graf tg...@suug.ch writes: On 06/02/15 at 01:26pm, Eric W. Biederman wrote: What we really want here is xfrm-lite. By lite I mean the tunnel selection criteria is simple enough that it fits into the normal routing table instead of having to do weird flow based magic that is rarely needed. I believe what we want are the xfrm stacking of dst entries. I assume you are referring to reusing the selector and stacked dst. I considered that for the transmit side. Can you elaborate on this some more? How would this look like for the specific case of VXLAN? Any thoughts on the receive side? You also mention that you dislike the net_device approach. What do you suggest instead? The encapsulation is often postponed to after the packet is fully constructed. Where should it get hooked into? Thomas I may have misunderstood what you are trying to do. Is what you were aiming for roughly the existing RTA_FLOW so you can transmit packets out one network device and have enough information to know which of a set of tunnels of a given type you want the packets go into? Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 8/9] net: dsa: mv88e6352: set port 802.1Q mode to Secure
Hi Guenter, On Jun 2, 2015, at 10:31 AM, Guenter Roeck li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: This commit changes the 802.1Q mode of each port from Disabled to Secure. This enables the VLAN support, by checking the VTU entries on ingress. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com --- drivers/net/dsa/mv88e6xxx.c | 14 +++--- drivers/net/dsa/mv88e6xxx.h | 5 + 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index ed49bd8..35243d8 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -1723,13 +1723,11 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) goto abort; } -/* Port Control 2: don't force a good FCS, set the maximum - * frame size to 10240 bytes, don't let the switch add or - * strip 802.1q tags, don't discard tagged or untagged frames - * on this port, do a destination address lookup on all - * received packets as usual, disable ARP mirroring and don't - * send a copy of all transmitted/received frames on this port - * to the CPU. +/* Port Control 2: don't force a good FCS, set the maximum frame size to + * 10240 bytes, enable secure 802.1q tags, don't discard tagged or + * untagged frames on this port, do a destination address lookup on all + * received packets as usual, disable ARP mirroring and don't send a + * copy of all transmitted/received frames on this port to the CPU. */ reg = 0; if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) || @@ -1751,6 +1749,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) reg |= PORT_CONTROL_2_FORWARD_UNKNOWN; } +reg |= PORT_CONTROL_2_8021Q_SECURE; + Hi Vivien, Unless I misunderstand the documentation, this effectively disables VLAN support on non-bridge ports, especially since the ndo_ functions to add VLAN entries to such ports are not implemented. Is that intentional, or am I missing something ? Indeed, I intentionaly set the port mode to Secure to work on 802.1q. For both cases, the Fallback mode should be enough; this mode checks the VTU for a valid entry, otherwise checks the port-based VLAN map. Supporting port-based VLAN looks like another tricky thread. Ideally, this must be configurable. In my case I do need strict 802.1q. Can ethtool/iproute2 can do something about the port 802.1q mode? Thanks, -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 0/2] pci: Provide a flag to access VPD through function 0
Many multi-function devices provide shared registers in extended config space for accessing VPD. The behavior of these registers means that the state must be tracked and access locked correctly for accesses not to hang or worse. One way to meet these needs is to always perform the accesses through function 0, thereby using the state tracking and mutex that already exists. To provide this behavior, add a dev_flags bit to indicate that this should be done. This bit can then be set for any non-zero function that needs to redirect such VPD access to function 0. Do not set this bit on the zero function or there will be an infinite recursion. The second patch uses this new flag to invoke this behavior on all multi-function Intel Ethernet devices. Signed-off-by: Mark Rustad mark.d.rus...@intel.com --- Changes in V2: - Corrected a spelling error in a log message - Added checks to see that the referenced function 0 is reasonable --- Mark D Rustad (2): pci: Add dev_flags bit to access VPD through function 0 pci: Add VPD quirk for Intel Ethernet devices drivers/pci/access.c | 48 +++- drivers/pci/quirks.c |9 + 2 files changed, 56 insertions(+), 1 deletion(-) -- Mark Rustad, Network Division, Intel Corporation -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 1/2] pci: Add dev_flags bit to access VPD through function 0
Add a dev_flags bit, PCI_DEV_FLAGS_VPD_REF_F0, to access VPD through function 0 to provide VPD access on other functions. This solves concurrent access problems on many devices without changing the attributes exposed in sysfs. Never set this bit on function 0 or there will be an infinite recursion. Signed-off-by: Mark Rustad mark.d.rus...@intel.com --- Changes in V2: - Corrected spelling in log message - Added checks to see that the referenced function 0 is reasonable --- drivers/pci/access.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index d9b64a175990..74634d4868a2 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -439,6 +439,40 @@ static const struct pci_vpd_ops pci_vpd_pci22_ops = { .release = pci_vpd_pci22_release, }; +static ssize_t pci_vpd_f0_read(struct pci_dev *dev, loff_t pos, size_t count, + void *arg) +{ + struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn)); + ssize_t ret; + + if (!tdev) + return -ENODEV; + + ret = pci_read_vpd(tdev, pos, count, arg); + pci_dev_put(tdev); + return ret; +} + +static ssize_t pci_vpd_f0_write(struct pci_dev *dev, loff_t pos, size_t count, + const void *arg) +{ + struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn)); + ssize_t ret; + + if (!tdev) + return -ENODEV; + + ret = pci_write_vpd(tdev, pos, count, arg); + pci_dev_put(tdev); + return ret; +} + +static const struct pci_vpd_ops pci_vpd_f0_ops = { + .read = pci_vpd_f0_read, + .write = pci_vpd_f0_write, + .release = pci_vpd_pci22_release, +}; + int pci_vpd_pci22_init(struct pci_dev *dev) { struct pci_vpd_pci22 *vpd; @@ -447,12 +481,24 @@ int pci_vpd_pci22_init(struct pci_dev *dev) cap = pci_find_capability(dev, PCI_CAP_ID_VPD); if (!cap) return -ENODEV; + if (dev-dev_flags PCI_DEV_FLAGS_VPD_REF_F0) { + struct pci_dev *tdev; + + tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn)); + if (!tdev || !dev-multifunction || !tdev-multifunction || + dev-class != tdev-class || dev-vendor != tdev-vendor || + dev-device != tdev-device) + return -ENODEV; + } vpd = kzalloc(sizeof(*vpd), GFP_ATOMIC); if (!vpd) return -ENOMEM; vpd-base.len = PCI_VPD_PCI22_SIZE; - vpd-base.ops = pci_vpd_pci22_ops; + if (dev-dev_flags PCI_DEV_FLAGS_VPD_REF_F0) + vpd-base.ops = pci_vpd_f0_ops; + else + vpd-base.ops = pci_vpd_pci22_ops; mutex_init(vpd-lock); vpd-cap = cap; vpd-busy = false; diff --git a/include/linux/pci.h b/include/linux/pci.h index 353db8dc4c6e..194df6d635e6 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -180,6 +180,8 @@ enum pci_dev_flags { PCI_DEV_FLAGS_NO_BUS_RESET = (__force pci_dev_flags_t) (1 6), /* Do not use PM reset even if device advertises NoSoftRst- */ PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 7), + /* Get VPD from function 0 VPD */ + PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 8), }; enum pci_irq_reroute_variant { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net iproute2 v4] mpls: always set type RTN_UNICAST and scope RT_SCOPE_UNIVERSE for route add/deletes
From: Roopa Prabhu ro...@cumulusnetworks.com This patch fixes incorrect -EINVAL errors due to invalid scope and type during mpls route deletes. $ip -f mpls route add 100 as 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route show 100 as to 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1 RTNETLINK answers: Invalid argument $ip -f mpls route del 100 RTNETLINK answers: Invalid argument After patch: $ip -f mpls route show 100 as to 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route del 100 as 200 via inet 10.1.1.2 dev swp1 $ip -f mpls route show Always set type to RTN_UNICAST for mpls route add/deletes. Also to keep things consistent with kernel set scope to RT_SCOPE_UNIVERSE for both mpls and ipv6 routes. Both mpls and ipv6 route deletes ignore scope. Suggested-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com Signed-off-by: Vivek Venkataraman vi...@cumulusnetworks.com -- v4 move fix to iproute2 --- ip/iproute.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 670a4c6..d0b9910 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) int scope_ok = 0; int table_ok = 0; int raw = 0; + int type_ok = 0; memset(req, 0, sizeof(req)); @@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) rtnl_rtntype_a2n(type, *argv) == 0) { NEXT_ARG(); req.r.rtm_type = type; + type_ok = 1; } if (matches(*argv, help) == 0) @@ -1136,6 +1138,9 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) if (nhs_ok) parse_nexthops(req.n, req.r, argc, argv); + if (req.r.rtm_family == AF_UNSPEC) + req.r.rtm_family = AF_INET; + if (!table_ok) { if (req.r.rtm_type == RTN_LOCAL || req.r.rtm_type == RTN_BROADCAST || @@ -1144,8 +1149,11 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) req.r.rtm_table = RT_TABLE_LOCAL; } if (!scope_ok) { - if (req.r.rtm_type == RTN_LOCAL || - req.r.rtm_type == RTN_NAT) + if (req.r.rtm_family == AF_INET6 || + req.r.rtm_family == AF_MPLS) + req.r.rtm_scope = RT_SCOPE_UNIVERSE; + else if (req.r.rtm_type == RTN_LOCAL || +req.r.rtm_type == RTN_NAT) req.r.rtm_scope = RT_SCOPE_HOST; else if (req.r.rtm_type == RTN_BROADCAST || req.r.rtm_type == RTN_MULTICAST || @@ -1160,8 +1168,8 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) } } - if (req.r.rtm_family == AF_UNSPEC) - req.r.rtm_family = AF_INET; + if (!type_ok req.r.rtm_family == AF_MPLS) + req.r.rtm_type = RTN_UNICAST; if (rtnl_talk(rth, req.n, 0, 0, NULL) 0) return -2; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 7/9] net: dsa: mv88e6352: lock CPU port from learning addresses
Hi Guenter, On Jun 2, 2015, at 10:24 AM, Guenter Roeck li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: This commit disables SA learning and refreshing for the CPU port. Hi Vivien, This patch also seems to be unrelated to the rest of the series. Can you add an explanation why it is needed ? With this in place, how does the CPU port SA find its way into the fdb ? Do we assume that it will be configured statically ? An explanation might be useful. Without this patch, I noticed the CPU port was stealing the SA of a PC behind a switch port. this happened when the port was a bridge member, as the bridge was relaying broadcast coming from one switch port to the other switch ports in the same vlan. Thanks, -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote: As far as I can tell, enabling IP_RECVERR causes the presence of a queued error to cause recvmsg, etc to return an error (once). It's worse, though: a new error can be queued asynchronously at any time, this setting sk_err to a nonzero value. How do I sensibly distinguish recvmsg failures to to genuine errors receiving messages from recvmsg failures because there's a queued error? The only way I can see to get reliable error handling is to literally call recvmsg in a loop: while (true /* or while POLLIN is set */) { int ret = recvmsg(..., MSG_ERRQUEUE not set); if (ret 0 /* what goes here? */) { whoops! this might be a harmless asynchronous error! take no action! } I see either two possibilities: We export the icmp_err_convert tables along with the udp_lib_err error conversions to user space and spice them up with flags to mark if they are transient (icmp_err_convert already has a fatal flag). This seems overcomplicated. I'd rather have a flag I pass to tell the kernel that I don't want to see transient errors (nd that I'll clear them myself using POLLERR and either MSG_ERRQUEUE or SO_ERROR. Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after you got a ret 0 when calling without MSG_ERRQUEUE and inspect the sock_extended_err, no? I do this already, which makes me think that there's a bug or another race somewhere. I've only seen a failure once in several years of operation. The failure happened on a ping socket. I suspect that the race is: ping_err: ip_icmp_error(...); user: recvmsg(MSG_ERRQUEUE) and dequeues the error. ping_err: sk_err = err; user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the error via sock_error. user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN. Now the user code thinks that it was a real (non-transient) error and aborts. Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE? Even if this race were fixed, this interface still sucks IMO. --Andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next v2] ipv4: inet_bind: check the addr_len first
On Tue, Jun 2, 2015, at 17:13, Denis Kirjanov wrote: On 6/2/15, Hannes Frederic Sowa han...@stressinduktion.org wrote: Hello, On Tue, Jun 2, 2015, at 14:21, Denis Kirjanov wrote: Perform the address length check first, before calling the proto specific bind() function Can you give more detail why you did this change and what bug it fixes? I've sent the v2 version with the net-next tag. The idea is simple: check the error condition first and then do the useful work. Hmm, IMHO the specific proto-bind handlers have to take care of the check themselves. You could argue that we should do the checks always in inet_bind but then you have to remove the addr_len checks from the raw and ping bind handlers, otherwise people become confused if they modify the code. I am in favor of leaving the current logic as is, sorry. Thanks, Hannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote: On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote: As far as I can tell, enabling IP_RECVERR causes the presence of a queued error to cause recvmsg, etc to return an error (once). It's worse, though: a new error can be queued asynchronously at any time, this setting sk_err to a nonzero value. How do I sensibly distinguish recvmsg failures to to genuine errors receiving messages from recvmsg failures because there's a queued error? The only way I can see to get reliable error handling is to literally call recvmsg in a loop: while (true /* or while POLLIN is set */) { int ret = recvmsg(..., MSG_ERRQUEUE not set); if (ret 0 /* what goes here? */) { whoops! this might be a harmless asynchronous error! take no action! } I see either two possibilities: We export the icmp_err_convert tables along with the udp_lib_err error conversions to user space and spice them up with flags to mark if they are transient (icmp_err_convert already has a fatal flag). This seems overcomplicated. I'd rather have a flag I pass to tell the kernel that I don't want to see transient errors (nd that I'll clear them myself using POLLERR and either MSG_ERRQUEUE or SO_ERROR. Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after you got a ret 0 when calling without MSG_ERRQUEUE and inspect the sock_extended_err, no? I do this already, which makes me think that there's a bug or another race somewhere. I've only seen a failure once in several years of operation. The failure happened on a ping socket. I suspect that the race is: ping_err: ip_icmp_error(...); user: recvmsg(MSG_ERRQUEUE) and dequeues the error. ping_err: sk_err = err; user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the error via sock_error. user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN. Now the user code thinks that it was a real (non-transient) error and aborts. Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE? Hmm, I don't think this will help. Even if this race were fixed, this interface still sucks IMO. Yes. :/ My proposal would be to make the error conversion lazy: Keeping duplicate data is not a good idea in general: So we shouldn't use sk-sk_err if IP_RECVERR is set at all but let sock_error just use the sk_error_queue and extract the error code from there. Only if IP_RECVERR was not set, we use sk-sk_err logic. What do you think? Bye, Hannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
On 02/06/15 22:10, Eric W. Biederman wrote: Robert Shearman rshea...@brocade.com writes: On 02/06/15 19:11, Eric W. Biederman wrote: Robert Shearman rshea...@brocade.com writes: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. I am still digging into the details but adding a new network device to make this possible if very undesirable. It is a pain point. Those network devices get to be a major source of memory consumption when there are 4K network namespaces in existence. It is conceptually wrong. The network device will never be used as an ordinary network device. All the network device gives you is the ability to avoid creating an enumeration of different kinds of encapsulation. This isn't true. The network device also gives some of the things you take for granted. Things like fragmentation through specifying the mtu on the shared tunnel device, being able to specify rules using the shared tunnel output device, IP stats, and the ability specify a different destination namespace. Granted you get a few more things. It is still conceptually wrong as the network device will netver be used as an ordinary network device. Fragmentation is already silly because we are talking about multiple tunnels with different properties. You need per-route mtu to handle that case. It's unlikely you'll have a huge variation in the mtus across routes, unless you're running in an ISP environment. In the example uses we've got in hand, it's highly likely they'll only be a handful of different mtus, if that. Further I am not saying you don't need an output device (which is what is needed to specify a different destination namespace) I am saying that having a funny mpls device is wrong as far as I can see. Certainly it is a lot of bloody unnecessary overhead. If we are going to design for maximum scaling (and 1 million+ routes) sounds like maximum scaling we should see how far we can go without dragging in the horrible heaviness of additional network devices. 35K a piece last I measured it. Just a small handful of them are already scaling issues for network namespaces. For the ipmpls interface I've implemented here, you only need one per namespace. You could argue the same for the veth interfaces which would be much more commonly used in network namespaces. BTW, maybe I've missed something, or maybe netdevs have gone on a diet, but I count the cost of creating a basic interface at ~2700 bytes on x86_64: sizeof(struct net_device) /* 2112 */ + 1 * sizeof(struct netdev_queue) /* 384 */ + 1 * sizeof(struct netdev_rx_queue) /* 128 */ + sizeof(struct netdev_hw_addr) /* 80 */ + sizeof(int) * nr_poss_cpus /* 4 * n */) Thanks, Rob -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
Thomas Graf tg...@suug.ch writes: On 06/02/15 at 01:26pm, Eric W. Biederman wrote: What we really want here is xfrm-lite. By lite I mean the tunnel selection criteria is simple enough that it fits into the normal routing table instead of having to do weird flow based magic that is rarely needed. I believe what we want are the xfrm stacking of dst entries. I assume you are referring to reusing the selector and stacked dst. I considered that for the transmit side. Can you elaborate on this some more? How would this look like for the specific case of VXLAN? Any thoughts on the receive side? You also mention that you dislike the net_device approach. What do you suggest instead? The encapsulation is often postponed to after the packet is fully constructed. Where should it get hooked into? Things I think xfrm does correct today: - Transmitting things when an appropriate dst has been found. Things I think xfrm could do better: - Finding the dst entry. Having to perform a separate lookup in a second set of tables looks slow, and not much maintained. So if we focus on the normal routing case where lookup works today (aka no source port or destination port based routing or any of the other weird things so we can use a standard fib lookup I think I can explain what I imagine things would look like. To be clear I am focusing on the very light weight tunnels and I am not certain vxlan applies. It may be more reasonable to simply have a single ethernet looking device that does speaks vxlan behind the scenes. If I look at vxlan as a set of ipv4 host routes (no arp, no unknown host support) it looks like the kind of light-weight tunnel that we are dealing with for mpls. On the reception side packets that match the magic udp socket have their tunneling bits stripped off and pushed up to the ip layer. Roughly equivalent to the current af_mpls code. On the transmit side there would be a host route for each remote host. In the fib we would store a pointer to a data structure that holds a precomputed header to be prepended to the packet (inner ethernet, vxlan, outer udp, outer ip). That data pointer would become dst-xfrm when the route lookup happens and we generate a route/dst entry. There would also be an output function in the fib and that output function would be compue dst-output. I would be more specific but I forget the details of the fib_trie data structures. The output function function in the dst entry in the ipv4 route would know how to interpret the pointer in the ipv4 routing table, append the precomputed headers, update the precomputed udp header's source port with the flow hash of the the inner packet, and have an inner dst so that would essentially call ip_finish_output2 again and sending the packet to it's destination. There is some wiggle room but that is how I imagine things working, and that is what I think we want for the mpls case. Adding two pointers to the fib could be interesting. One pointer can be a union with the output network device, the other pointer I am not certain about. And of course we get fun cases where we have tunnels running through other tunnels. So there likely needs to be a bit of indirection going on. The problem I think needs to be solved is how to make tunnels very light weight and cheap, so the can scale to 1million+. Enough so that the kernel can hold a full routing table full of tunnels. It looks like xfrm is almost there but it's data structures appear to be excessively complicated and inscrutible, and the require an extra lookup. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] mpls: new ipmpls device for encapsulating IP packets as mpls
On 06/02/15 at 01:26pm, Eric W. Biederman wrote: What we really want here is xfrm-lite. By lite I mean the tunnel selection criteria is simple enough that it fits into the normal routing table instead of having to do weird flow based magic that is rarely needed. I believe what we want are the xfrm stacking of dst entries. I assume you are referring to reusing the selector and stacked dst. I considered that for the transmit side. Can you elaborate on this some more? How would this look like for the specific case of VXLAN? Any thoughts on the receive side? You also mention that you dislike the net_device approach. What do you suggest instead? The encapsulation is often postponed to after the packet is fully constructed. Where should it get hooked into? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
On 06/02/15 at 02:28pm, Robert Shearman wrote: Nesting attributes inside the RTA_ENCAP blob should be supported by the patch series today. Something like this: Sure. I'm not seeing such a construct for the MPLS case yet. I'm happy to rebase my patches on top of your nexthop implementation. It is definitely superior. Are you maintaining a git tree somewhere? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v3 2/2] mpls: fix mpls route deletes to not check for route scope
On 6/2/15, 2:13 PM, Eric W. Biederman wrote: So I just stopped and looked at what is happening. When you originally reported this you said (or at least I understood) that rtm_scope was not being set in iproute. I assumed that meant it was not being touched and it was taking a default value of zero (or else it was possibly floating). Having looked neither is true. iproute sets rtm_scope to RT_SCOPE_NOWHERE during delete deliberately to act as a wild card. In the kernel in other protocols currently ipv4 treats RT_SCOPE_NOWHERE as a wild card during delete, decnet treats RT_SCOPE_NOWHERE as a wild card during delete, the remaining protocols (ipv6, phonet, and can) that implement RTM_DELROUTE do not look at rtm_scope at all. Further ipv6 and phonet set rtm_scope to RT_SCOPE_UNIVERSE when dumped. Which says to me that we have semantics in the kernel that no one has let userspace know about, and that scares me when there is a misunderstanding between the kernel and userspace about what fields mean. That inevitabily leads to bugs. The kind of bugs that I have to create security fixes for recently. So I really think we should fix this in userspace so that that someone reading iproute will have a chance at knowing that this scopes do not exist in ipv6 and mpls and that scope logic is just noise in those cases. ack, i did start with handling both type and scope in iproute2. I misunderstood you when you said you did not care abt the scope in earlier comments. so i made the kernel not care abt the scope. :) but only handled type in 'iproute2' in v2. now its clear. I do have a similar patch like below. sorry abt the iterations. I will respin (If you prefer to post your below patch yourself, pls do. I am ok either way. Thanks. Something like: From 837dddea49af874fe750ab0712b3ef8066a2f55a Mon Sep 17 00:00:00 2001 From: Eric W. Biederman ebied...@xmission.com Date: Tue, 2 Jun 2015 15:51:31 -0500 Subject: [PATCH] iproute: When deleting routes don't always set the scope to RT_SCOPE_NOWHERE IPv6 and MPLS do not implement scopes on addresses and using RT_SCOPE_NOWHERE is just confusing noise. Use RT_SCOPE_UNIVERSE instead so that it is clear what is actually happening in the code. Signed-off-by: Eric W. Biederman ebied...@xmission.com --- ip/iproute.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index fba475f65314..e9b991fdf62f 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -1136,6 +1136,9 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) if (nhs_ok) parse_nexthops(req.n, req.r, argc, argv); + if (req.r.rtm_family == AF_UNSPEC) + req.r.rtm_family = AF_INET; + if (!table_ok) { if (req.r.rtm_type == RTN_LOCAL || req.r.rtm_type == RTN_BROADCAST || @@ -1144,7 +1147,10 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) req.r.rtm_table = RT_TABLE_LOCAL; } if (!scope_ok) { - if (req.r.rtm_type == RTN_LOCAL || + if (req.r.rtm_family == AF_INET6 || + req.r.rtm_family == AF_MPLS) + req.r.rtm_scope = RT_SCOPE_UNIVERSE; + else if (req.r.rtm_type == RTN_LOCAL || req.r.rtm_type == RTN_NAT) req.r.rtm_scope = RT_SCOPE_HOST; else if (req.r.rtm_type == RTN_BROADCAST || @@ -1160,9 +1166,6 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) } } - if (req.r.rtm_family == AF_UNSPEC) - req.r.rtm_family = AF_INET; - if (rtnl_talk(rth, req.n, NULL, 0) 0) return -2; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v3 2/2] mpls: fix mpls route deletes to not check for route scope
roopa ro...@cumulusnetworks.com writes: On 6/2/15, 2:13 PM, Eric W. Biederman wrote: So I just stopped and looked at what is happening. When you originally reported this you said (or at least I understood) that rtm_scope was not being set in iproute. I assumed that meant it was not being touched and it was taking a default value of zero (or else it was possibly floating). Having looked neither is true. iproute sets rtm_scope to RT_SCOPE_NOWHERE during delete deliberately to act as a wild card. In the kernel in other protocols currently ipv4 treats RT_SCOPE_NOWHERE as a wild card during delete, decnet treats RT_SCOPE_NOWHERE as a wild card during delete, the remaining protocols (ipv6, phonet, and can) that implement RTM_DELROUTE do not look at rtm_scope at all. Further ipv6 and phonet set rtm_scope to RT_SCOPE_UNIVERSE when dumped. Which says to me that we have semantics in the kernel that no one has let userspace know about, and that scares me when there is a misunderstanding between the kernel and userspace about what fields mean. That inevitabily leads to bugs. The kind of bugs that I have to create security fixes for recently. So I really think we should fix this in userspace so that that someone reading iproute will have a chance at knowing that this scopes do not exist in ipv6 and mpls and that scope logic is just noise in those cases. ack, i did start with handling both type and scope in iproute2. I misunderstood you when you said you did not care abt the scope in earlier comments. so i made the kernel not care abt the scope. :) but only handled type in 'iproute2' in v2. now its clear. I do have a similar patch like below. sorry abt the iterations. I will respin (If you prefer to post your below patch yourself, pls do. I am ok either way. Thanks. I don't have enough energy to follow through with more than review today. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v3 2/2] mpls: fix mpls route deletes to not check for route scope
Roopa Prabhu ro...@cumulusnetworks.com writes: From: Roopa Prabhu ro...@cumulusnetworks.com Ignore scope for route del messages So I just stopped and looked at what is happening. When you originally reported this you said (or at least I understood) that rtm_scope was not being set in iproute. I assumed that meant it was not being touched and it was taking a default value of zero (or else it was possibly floating). Having looked neither is true. iproute sets rtm_scope to RT_SCOPE_NOWHERE during delete deliberately to act as a wild card. In the kernel in other protocols currently ipv4 treats RT_SCOPE_NOWHERE as a wild card during delete, decnet treats RT_SCOPE_NOWHERE as a wild card during delete, the remaining protocols (ipv6, phonet, and can) that implement RTM_DELROUTE do not look at rtm_scope at all. Further ipv6 and phonet set rtm_scope to RT_SCOPE_UNIVERSE when dumped. Which says to me that we have semantics in the kernel that no one has let userspace know about, and that scares me when there is a misunderstanding between the kernel and userspace about what fields mean. That inevitabily leads to bugs. The kind of bugs that I have to create security fixes for recently. So I really think we should fix this in userspace so that that someone reading iproute will have a chance at knowing that this scopes do not exist in ipv6 and mpls and that scope logic is just noise in those cases. Something like: From 837dddea49af874fe750ab0712b3ef8066a2f55a Mon Sep 17 00:00:00 2001 From: Eric W. Biederman ebied...@xmission.com Date: Tue, 2 Jun 2015 15:51:31 -0500 Subject: [PATCH] iproute: When deleting routes don't always set the scope to RT_SCOPE_NOWHERE IPv6 and MPLS do not implement scopes on addresses and using RT_SCOPE_NOWHERE is just confusing noise. Use RT_SCOPE_UNIVERSE instead so that it is clear what is actually happening in the code. Signed-off-by: Eric W. Biederman ebied...@xmission.com --- ip/iproute.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index fba475f65314..e9b991fdf62f 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -1136,6 +1136,9 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) if (nhs_ok) parse_nexthops(req.n, req.r, argc, argv); + if (req.r.rtm_family == AF_UNSPEC) + req.r.rtm_family = AF_INET; + if (!table_ok) { if (req.r.rtm_type == RTN_LOCAL || req.r.rtm_type == RTN_BROADCAST || @@ -1144,7 +1147,10 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) req.r.rtm_table = RT_TABLE_LOCAL; } if (!scope_ok) { - if (req.r.rtm_type == RTN_LOCAL || + if (req.r.rtm_family == AF_INET6 || + req.r.rtm_family == AF_MPLS) + req.r.rtm_scope = RT_SCOPE_UNIVERSE; + else if (req.r.rtm_type == RTN_LOCAL || req.r.rtm_type == RTN_NAT) req.r.rtm_scope = RT_SCOPE_HOST; else if (req.r.rtm_type == RTN_BROADCAST || @@ -1160,9 +1166,6 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) } } - if (req.r.rtm_family == AF_UNSPEC) - req.r.rtm_family = AF_INET; - if (rtnl_talk(rth, req.n, NULL, 0) 0) return -2; -- 2.2.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net iproute2 v3 1/2] mpls: always set type as RTN_UNICAST for route add/deletes
Roopa Prabhu ro...@cumulusnetworks.com writes: From: Roopa Prabhu ro...@cumulusnetworks.com Kernel expects type RTN_UNICAST for mpls route/dels There almost a bug in this patch. You test req.r.rtm_family just before the default case of AF_UNSPEC is set to AF_INET. Which should not affect anything in this case but is down right confusing to think about, and could lead to maintenance problems in the future. Otherwise Acked-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Vivek Venkataraman vi...@cumulusnetworks.com Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com Reviewed-by: Robert Shearman rshea...@brocade.com --- ip/iproute.c |5 + 1 file changed, 5 insertions(+) diff --git a/ip/iproute.c b/ip/iproute.c index 670a4c6..71c088b 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -803,6 +803,7 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) int scope_ok = 0; int table_ok = 0; int raw = 0; + int type_ok = 0; memset(req, 0, sizeof(req)); @@ -1095,6 +1096,7 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) rtnl_rtntype_a2n(type, *argv) == 0) { NEXT_ARG(); req.r.rtm_type = type; + type_ok = 1; } if (matches(*argv, help) == 0) @@ -1160,6 +1162,9 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) } } + if (!type_ok req.r.rtm_family == AF_MPLS) + req.r.rtm_type = RTN_UNICAST; + if (req.r.rtm_family == AF_UNSPEC) req.r.rtm_family = AF_INET; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2 -next] tc: {f,m}_bpf: allow to retrieve uds path from env
Allow to retrieve uds path from the environment, facilitates also dealing with export a bit. Signed-off-by: Daniel Borkmann dan...@iogearbox.net --- tc/f_bpf.c | 6 -- tc/m_bpf.c | 6 -- tc/tc_bpf.h | 2 ++ 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/tc/f_bpf.c b/tc/f_bpf.c index 597ef60..c21bf33 100644 --- a/tc/f_bpf.c +++ b/tc/f_bpf.c @@ -122,6 +122,7 @@ opt_bpf: NEXT_ARG(); if (ebpf) { + bpf_uds_name = secure_getenv(BPF_ENV_UDS); bpf_obj = *argv; NEXT_ARG(); @@ -131,8 +132,9 @@ opt_bpf: bpf_sec_name = *argv; NEXT_ARG(); } - if (strcmp(*argv, export) == 0 || - strcmp(*argv, exp) == 0) { + if (!bpf_uds_name + (strcmp(*argv, export) == 0 || +strcmp(*argv, exp) == 0)) { NEXT_ARG(); bpf_uds_name = *argv; NEXT_ARG(); diff --git a/tc/m_bpf.c b/tc/m_bpf.c index 0621157..9ddb667 100644 --- a/tc/m_bpf.c +++ b/tc/m_bpf.c @@ -105,6 +105,7 @@ opt_bpf: NEXT_ARG(); if (ebpf) { + bpf_uds_name = secure_getenv(BPF_ENV_UDS); bpf_obj = *argv; NEXT_ARG(); @@ -114,8 +115,9 @@ opt_bpf: bpf_sec_name = *argv; NEXT_ARG(); } - if (strcmp(*argv, export) == 0 || - strcmp(*argv, exp) == 0) { + if (!bpf_uds_name + (strcmp(*argv, export) == 0 || +strcmp(*argv, exp) == 0)) { NEXT_ARG(); bpf_uds_name = *argv; NEXT_ARG(); diff --git a/tc/tc_bpf.h b/tc/tc_bpf.h index 5a697e5..2ad8812 100644 --- a/tc/tc_bpf.h +++ b/tc/tc_bpf.h @@ -25,6 +25,8 @@ #include utils.h #include bpf_scm.h +#define BPF_ENV_UDSTC_BPF_UDS + int bpf_parse_string(char *arg, bool from_file, __u16 *bpf_len, char **bpf_string, bool *need_release, const char separator); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015 at 2:42 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote: On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote: As far as I can tell, enabling IP_RECVERR causes the presence of a queued error to cause recvmsg, etc to return an error (once). It's worse, though: a new error can be queued asynchronously at any time, this setting sk_err to a nonzero value. How do I sensibly distinguish recvmsg failures to to genuine errors receiving messages from recvmsg failures because there's a queued error? The only way I can see to get reliable error handling is to literally call recvmsg in a loop: while (true /* or while POLLIN is set */) { int ret = recvmsg(..., MSG_ERRQUEUE not set); if (ret 0 /* what goes here? */) { whoops! this might be a harmless asynchronous error! take no action! } I see either two possibilities: We export the icmp_err_convert tables along with the udp_lib_err error conversions to user space and spice them up with flags to mark if they are transient (icmp_err_convert already has a fatal flag). This seems overcomplicated. I'd rather have a flag I pass to tell the kernel that I don't want to see transient errors (nd that I'll clear them myself using POLLERR and either MSG_ERRQUEUE or SO_ERROR. Otherwise you should be able to call recvmsg with MSG_ERRQUEUE set after you got a ret 0 when calling without MSG_ERRQUEUE and inspect the sock_extended_err, no? I do this already, which makes me think that there's a bug or another race somewhere. I've only seen a failure once in several years of operation. The failure happened on a ping socket. I suspect that the race is: ping_err: ip_icmp_error(...); user: recvmsg(MSG_ERRQUEUE) and dequeues the error. ping_err: sk_err = err; user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the error via sock_error. user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN. Now the user code thinks that it was a real (non-transient) error and aborts. Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE? Hmm, I don't think this will help. It won't help this race, but it'll at least make it clearer that the code has some kind of reasonably well-defined semantics. Even if this race were fixed, this interface still sucks IMO. Yes. :/ My proposal would be to make the error conversion lazy: Keeping duplicate data is not a good idea in general: So we shouldn't use sk-sk_err if IP_RECVERR is set at all but let sock_error just use the sk_error_queue and extract the error code from there. Only if IP_RECVERR was not set, we use sk-sk_err logic. What do you think? That seems entirely sensible to me, except that it might break some existing application. There's also this code: if ((family == AF_INET !inet_sock-recverr) || (family == AF_INET6 !inet6_sk(sk)-recverr)) { if (!harderr || sk-sk_state != TCP_ESTABLISHED) goto out; -- skips the assignment to sk_err which means that recverr kind of has the opposite semantics right now. In fact, the man page agrees with the current behavior (minus the race): IP_RECVERR (since Linux 2.2) Enable extended reliable error message passing. When enabled on a datagram socket, all generated errors will be queued in a per- socket error queue. When the user receives an error from a socket operation, the errors can be received by calling recvmsg(2) withtheMSG_ERRQUEUEflagset. The sock_extended_err structure describing the error will be passed in an ancillary message with the type IP_RECVERR and the level IPPROTO_IP. This is useful for reliable error handling on unconnected sockets. The received data portion of the error queue contains the error packet. The sensible semantics would be to change this to When the user receives POLLERR, the errors can be received So maybe there should be another value for IP_RECVERR to opt in to the alternate semantics. --Andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I avoid recvmsg races with IP_RECVERR?
On Tue, Jun 2, 2015, at 23:42, Hannes Frederic Sowa wrote: On Tue, Jun 2, 2015, at 23:33, Andy Lutomirski wrote: On Tue, Jun 2, 2015 at 2:17 PM, Hannes Frederic Sowa han...@stressinduktion.org wrote: On Tue, Jun 2, 2015, at 21:40, Andy Lutomirski wrote: [...] I do this already, which makes me think that there's a bug or another race somewhere. I've only seen a failure once in several years of operation. The failure happened on a ping socket. I suspect that the race is: ping_err: ip_icmp_error(...); user: recvmsg(MSG_ERRQUEUE) and dequeues the error. ping_err: sk_err = err; user: recvmsg(MSG_ERRQUEUE not set), and recvmsg sees and clears the error via sock_error. user: recvmsg(MSG_ERRQUEUE), and recvmsg returns -EAGAIN. Now the user code thinks that it was a real (non-transient) error and aborts. Shouldn't that sk-sk_err = err assignment at least use WRITE_ONCE? Hmm, I don't think this will help. Even if this race were fixed, this interface still sucks IMO. Yes. :/ My proposal would be to make the error conversion lazy: Keeping duplicate data is not a good idea in general: So we shouldn't use sk-sk_err if IP_RECVERR is set at all but let sock_error just use the sk_error_queue and extract the error code from there. Only if IP_RECVERR was not set, we use sk-sk_err logic. What do you think? I just noticed that this will probably break existing user space applications which require that icmp errors are transient even with IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer and xchg the pointer, hmmm Bye, Hannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops
On 06/02/2015 12:44 AM, Scott Feldman wrote: That brings up an interesting point about having multiple bridges with the same vlan configured. I struggled with that problem with rocker also and I don't have an answer other than don't do that. Or, better put, if you have multiple bridge on the same vlan, just use one bridge for that vlan. Otherwise, I don't know how at the device level to partition the vlan between the bridges. Maybe that's what Vivien is facing also? I can see how this works for software-only bridges, because they should be isolated from each other and independent. But when offloading to a device which sees VLAN XXX global across the entire switch, I don't see how we can preserve the bridge boundaries. Scott, I'm confused by this. I think you're saying this config is problematic: br0: eth0.100, eth1.100 br1: eth2.100, eth3.100 But this works fine today. Could you clarify the issue you're referring to? Thanks, - nolan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] bpf: introduce bpf_clone_redirect() helper
Allow eBPF programs attached to classifier/actions to call bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or redirect the packet by dynamic ifindex selection from within the program to a target device either at ingress or at egress. Can be used for various scenarios, for example, to load balance skbs into veths, split parts of the traffic to local taps, etc. Signed-off-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net --- include/uapi/linux/bpf.h | 10 ++ net/core/filter.c| 40 2 files changed, 50 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 72f3080afa1e..42aa19abab86 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -220,6 +220,16 @@ enum bpf_func_id { * Return: 0 on success */ BPF_FUNC_tail_call, + + /** +* bpf_clone_redirect(skb, ifindex, flags) - redirect to another netdev +* @skb: pointer to skb +* @ifindex: ifindex of the net device +* @flags: bit 0 - if set, redirect to ingress instead of egress +* other bits - reserved +* Return: 0 on success +*/ + BPF_FUNC_clone_redirect, __BPF_FUNC_MAX_ID, }; diff --git a/net/core/filter.c b/net/core/filter.c index b78a010a957f..64c121c09655 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -46,6 +46,7 @@ #include linux/seccomp.h #include linux/if_vlan.h #include linux/bpf.h +#include net/sch_generic.h /** * sk_filter - run a packet through a socket filter @@ -1407,6 +1408,43 @@ const struct bpf_func_proto bpf_l4_csum_replace_proto = { .arg5_type = ARG_ANYTHING, }; +#define BPF_IS_REDIRECT_INGRESS(flags) ((flags) 1) + +static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5) +{ + struct sk_buff *skb = (struct sk_buff *) (long) r1, *skb2; + struct net_device *dev; + + dev = dev_get_by_index_rcu(dev_net(skb-dev), ifindex); + if (unlikely(!dev)) + return -EINVAL; + + if (unlikely(!(dev-flags IFF_UP))) + return -EINVAL; + + skb2 = skb_clone(skb, GFP_ATOMIC); + if (unlikely(!skb2)) + return -ENOMEM; + + if (G_TC_AT(skb2-tc_verd) AT_INGRESS) + skb_push(skb2, skb2-mac_len); + + if (BPF_IS_REDIRECT_INGRESS(flags)) + return dev_forward_skb(dev, skb2); + + skb2-dev = dev; + return dev_queue_xmit(skb2); +} + +const struct bpf_func_proto bpf_clone_redirect_proto = { + .func = bpf_clone_redirect, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; + static const struct bpf_func_proto * sk_filter_func_proto(enum bpf_func_id func_id) { @@ -1440,6 +1478,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return bpf_l3_csum_replace_proto; case BPF_FUNC_l4_csum_replace: return bpf_l4_csum_replace_proto; + case BPF_FUNC_clone_redirect: + return bpf_clone_redirect_proto; default: return sk_filter_func_proto(func_id); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] IP imposition of per-nh MPLS encap
Robert Shearman rshea...@brocade.com writes: On 02/06/15 22:10, Eric W. Biederman wrote: Robert Shearman rshea...@brocade.com writes: On 02/06/15 19:11, Eric W. Biederman wrote: Robert Shearman rshea...@brocade.com writes: In order to be able to function as a Label Edge Router in an MPLS network, it is necessary to be able to take IP packets and impose an MPLS encap and forward them out. The traditional approach of setting up an interface for each tunnel endpoint doesn't scale for the common MPLS use-cases where each IP route tends to be assigned a different label as encap. The solution suggested here for further discussion is to provide the facility to define encap data on a per-nexthop basis using a new netlink attribue, RTA_ENCAP, which would be opaque to the IPv4/IPv6 forwarding code, but interpreted by the virtual interface assigned to the nexthop. A new ipmpls interface type is defined to show the use of this facility to allow IP packets to be imposed with an MPLS encap. However, the facility is designed to be general enough to be used by any encapsulation/tunneling mechanism that has similar requirements of high-scale, high-variation-of-encap. I am still digging into the details but adding a new network device to make this possible if very undesirable. It is a pain point. Those network devices get to be a major source of memory consumption when there are 4K network namespaces in existence. It is conceptually wrong. The network device will never be used as an ordinary network device. All the network device gives you is the ability to avoid creating an enumeration of different kinds of encapsulation. This isn't true. The network device also gives some of the things you take for granted. Things like fragmentation through specifying the mtu on the shared tunnel device, being able to specify rules using the shared tunnel output device, IP stats, and the ability specify a different destination namespace. Granted you get a few more things. It is still conceptually wrong as the network device will netver be used as an ordinary network device. Fragmentation is already silly because we are talking about multiple tunnels with different properties. You need per-route mtu to handle that case. It's unlikely you'll have a huge variation in the mtus across routes, unless you're running in an ISP environment. In the example uses we've got in hand, it's highly likely they'll only be a handful of different mtus, if that. Did we ever implement an mpls mtu per netdevice (I think so). Anyway the tunnel mtu is easy enough to calculate in context (base mtu - tunnel overhead). So for default we should not need to do much. Further I am not saying you don't need an output device (which is what is needed to specify a different destination namespace) I am saying that having a funny mpls device is wrong as far as I can see. Certainly it is a lot of bloody unnecessary overhead. If we are going to design for maximum scaling (and 1 million+ routes) sounds like maximum scaling we should see how far we can go without dragging in the horrible heaviness of additional network devices. 35K a piece last I measured it. Just a small handful of them are already scaling issues for network namespaces. For the ipmpls interface I've implemented here, you only need one per namespace. You could argue the same for the veth interfaces which would be much more commonly used in network namespaces. But if I can avoid the extra 143M (35Kibibytes*4096namespaces) I would like to. On the drawing board is getting cross namespace routes so with a little luck I will only need loopback devices in most of my network namespaces when the dust settles. Outputing to network devices in another network namespace is fundamentally simple but I haven't take the time to figure out which assumptions I may have to purge to make it work reliably. BTW, maybe I've missed something, or maybe netdevs have gone on a diet, but I count the cost of creating a basic interface at ~2700 bytes on x86_64: sizeof(struct net_device) /* 2112 */ + 1 * sizeof(struct netdev_queue) /* 384 */ + 1 * sizeof(struct netdev_rx_queue) /* 128 */ + sizeof(struct netdev_hw_addr) /* 80 */ + sizeof(int) * nr_poss_cpus /* 4 * n */) It is a non-trivial thing to measure. You really have to create a lot of them and see how much memory is consumed. But between the per cpu stats, the sysctl attributes, the sysfs attribute and everything else an actual working netdevice in an all yes config kernel was consuming something like 35K not too long ago. Eric -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 7/9] net: dsa: mv88e6352: lock CPU port from learning addresses
On 06/02/2015 07:31 PM, Chris Healy wrote: Guenter, That's a very valid concern. I have a configuration with a 6352 controlled by a low end ARM core with a 100mbps connection on the CPU port. This switch needs to support passing multicast streams that are more than 100mbps on GigE links. (The ARM does not need to consume the multicast, it just manages the switch.) Possibly, but Vivien didn't answer my question (how the local SA address finds its way into the switch fdb). I'll check it myself. Thanks, Guenter On Jun 3, 2015 3:24 AM, Guenter Roeck li...@roeck-us.net mailto:li...@roeck-us.net wrote: On Tue, Jun 02, 2015 at 09:06:15PM -0400, Vivien Didelot wrote: Hi Guenter, On Jun 2, 2015, at 10:24 AM, Guenter Roeck li...@roeck-us.net mailto:li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: This commit disables SA learning and refreshing for the CPU port. Hi Vivien, This patch also seems to be unrelated to the rest of the series. Can you add an explanation why it is needed ? With this in place, how does the CPU port SA find its way into the fdb ? Do we assume that it will be configured statically ? An explanation might be useful. Without this patch, I noticed the CPU port was stealing the SA of a PC behind a switch port. this happened when the port was a bridge member, as the bridge was relaying broadcast coming from one switch port to the other switch ports in the same vlan. Makes me feel really uncomfortable. I think we may be going into the wrong direction. The whole point of offloading bridging is to have the switch handle forwarding, and that includes multicasts and broadcasts. Instead of doing that, it looks like we put more and more workarounds in place. Maybe the software bridge code needs to understand that it isn't support to forward broadcasts to ports of an offloaded bridge, and we should let the switch chip handle it ? Thanks, Guenter -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 3/9] net: dsa: mv88e6xxx: add support for VTU ops
Guenter, On Jun 2, 2015, at 2:50 AM, Guenter Roeck li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: +/* Bringing an interface up adds it to the VLAN 0. Ignore this. */ +if (!vid) +return 0; + Me puzzled ;-). I brought this and the fid question up before. No idea if my e-mail got lost or what happened. Can you explain why we don't need a configuration for vlan 0 ? Sorry for late reply. Initially, when issuing ip link set up dev swp0, ndo_vlan_rx_add_vid was called to add the interface in the VLAN 0. 2 things happen here: * this is inconsistent with the bridge vlan output which doesn't seem to know about a VID 0; * VID 0 seems special for this switch: if an ingressing frame has VID 0, the tagged port will override the VID bits with the port default VID at egress. Thanks, -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 6/9] net: dsa: mv88e6352: allow egress of unknown multicast
Hi Guenter, On Jun 2, 2015, at 10:20 AM, Guenter Roeck li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: This patch disables egress of unknown unicast destination addresses. Hi Vivien, seems to me this patch is unrelated to the rest of the series. Not sure if we really want this. If an address is in the arp cache but has timed out from the bridge database, any unicast to that address will no longer be sent. If the bridge database has been flushed for some reason, such as a spanning tree reconfiguration, we'll have a hard time to send anything. What is the problem you are trying to solve with this patch ? TBH, I don't remember which one of the test cases I described in 0/9 this patch was solving... Some ARP request didn't propagate correctly without this, IIRC. I'll try to revert the change and do my tests again in order to isolate the problem. Thanks, -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 7/9] net: dsa: mv88e6352: lock CPU port from learning addresses
On Tue, Jun 02, 2015 at 09:06:15PM -0400, Vivien Didelot wrote: Hi Guenter, On Jun 2, 2015, at 10:24 AM, Guenter Roeck li...@roeck-us.net wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: This commit disables SA learning and refreshing for the CPU port. Hi Vivien, This patch also seems to be unrelated to the rest of the series. Can you add an explanation why it is needed ? With this in place, how does the CPU port SA find its way into the fdb ? Do we assume that it will be configured statically ? An explanation might be useful. Without this patch, I noticed the CPU port was stealing the SA of a PC behind a switch port. this happened when the port was a bridge member, as the bridge was relaying broadcast coming from one switch port to the other switch ports in the same vlan. Makes me feel really uncomfortable. I think we may be going into the wrong direction. The whole point of offloading bridging is to have the switch handle forwarding, and that includes multicasts and broadcasts. Instead of doing that, it looks like we put more and more workarounds in place. Maybe the software bridge code needs to understand that it isn't support to forward broadcasts to ports of an offloaded bridge, and we should let the switch chip handle it ? Thanks, Guenter -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 5/9] net: dsa: mv88e6352: disable mirroring
On Tue, Jun 02, 2015 at 09:12:30PM -0400, Vivien Didelot wrote: Hi Guenter, Andrew, On Jun 2, 2015, at 10:53 AM, Andrew Lunn and...@lunn.ch wrote: On Tue, Jun 02, 2015 at 07:16:10AM -0700, Guenter Roeck wrote: On 06/01/2015 06:27 PM, Vivien Didelot wrote: Disable the mirroring policy in the monitor control register, since this feature is not needed. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com Should this be a separate patch, unrelated to the patch set ? Indeed, this one is an unrelated patch, sorry. If I understand correctly, this effectively disables IGMP/MLD snooping. I think this warrants an explanation why that it not needed, not just a statement that it is not needed. +1 Especially since we might want to revisit this to implement IGMP/MLD snooping in the bridge. The hardware should be capable of it. This is something I want to disable because I can have several times gigabit traffic on my ports. This would end up in a bottleneck on the CPU port. Am I right? Not really. That should not be that much traffic. Besides, IGMP/MLD snooping still needs to be enabled separately, as well as egress monitoring. I don't think this has any impact on the traffic to the CPU port unless other configuration bits are set as well. Guenter -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Intel-wired-lan] [PATCH V2 1/2] pci: Add dev_flags bit to access VPD through function 0
On 06/02/2015 05:10 PM, Mark D Rustad wrote: Add a dev_flags bit, PCI_DEV_FLAGS_VPD_REF_F0, to access VPD through function 0 to provide VPD access on other functions. This solves concurrent access problems on many devices without changing the attributes exposed in sysfs. Never set this bit on function 0 or there will be an infinite recursion. Signed-off-by: Mark Rustad mark.d.rus...@intel.com --- Changes in V2: - Corrected spelling in log message - Added checks to see that the referenced function 0 is reasonable --- drivers/pci/access.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index d9b64a175990..74634d4868a2 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -439,6 +439,40 @@ static const struct pci_vpd_ops pci_vpd_pci22_ops = { .release = pci_vpd_pci22_release, }; +static ssize_t pci_vpd_f0_read(struct pci_dev *dev, loff_t pos, size_t count, + void *arg) +{ + struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn)); + ssize_t ret; + + if (!tdev) + return -ENODEV; + + ret = pci_read_vpd(tdev, pos, count, arg); + pci_dev_put(tdev); + return ret; +} + +static ssize_t pci_vpd_f0_write(struct pci_dev *dev, loff_t pos, size_t count, + const void *arg) +{ + struct pci_dev *tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn)); + ssize_t ret; + + if (!tdev) + return -ENODEV; + + ret = pci_write_vpd(tdev, pos, count, arg); + pci_dev_put(tdev); + return ret; +} + +static const struct pci_vpd_ops pci_vpd_f0_ops = { + .read = pci_vpd_f0_read, + .write = pci_vpd_f0_write, + .release = pci_vpd_pci22_release, +}; + int pci_vpd_pci22_init(struct pci_dev *dev) { struct pci_vpd_pci22 *vpd; @@ -447,12 +481,24 @@ int pci_vpd_pci22_init(struct pci_dev *dev) cap = pci_find_capability(dev, PCI_CAP_ID_VPD); if (!cap) return -ENODEV; + if (dev-dev_flags PCI_DEV_FLAGS_VPD_REF_F0) { + struct pci_dev *tdev; + + tdev = pci_get_slot(dev-bus, PCI_SLOT(dev-devfn)); + if (!tdev || !dev-multifunction || !tdev-multifunction || + dev-class != tdev-class || dev-vendor != tdev-vendor || + dev-device != tdev-device) + return -ENODEV; + } You can probably combine the dev-multifunction check with the dev_flags check. After all you don't need this workaround if the device is not multifunction. It might even make more sense to move the multifunction check to the quirk in patch 2/2. I also believe this leaks a reference to the device. You should be calling pci_dev_put(tdev) if tdev is not NULL. As such you probably need to split up the !tdev and the rest of the checks. - Alex -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] net: change fib behavior based on interface link status
This patch adds the ability to have the Linux kernel track whether or not a particular route should be used based on the link-status of the interface associated with the next-hop. Before this patch any link-failure on an interface that was serving as a gateway for some systems could result in those systems being isolated from the rest of the network as the stack would continue to attempt to send frames out of an interface that is actually linked-down. When the kernel is responsible for all forwarding, it should also be responsible for taking action when the traffic can no longer be forwarded -- there is no real need to outsource link-monitoring to userspace anymore. This feature is only enabled with the new sysctl set (default is off): net.core.kill_routes_on_linkdown = 1 When this is set, the following behavior can be observed (interface p8p1 is link-down): # ip route show default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 # ip route get 90.0.0.1 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 cache # ip route get 80.0.0.1 local 80.0.0.1 dev lo src 80.0.0.1 cache local # ip route get 80.0.0.2 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 cache While the route does remain in the table (so it can be modified if needed rather than being wiped away as it would be if IFF_UP was cleared), the proper next-hop is chosen automatically when the link is down. Now interface p8p1 is linked-up: # ip route show default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 # ip route get 90.0.0.1 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 cache # ip route get 80.0.0.1 local 80.0.0.1 dev lo src 80.0.0.1 cache local # ip route get 80.0.0.2 80.0.0.2 dev p8p1 src 80.0.0.1 cache and the output changes to what one would expect. Signed-off-by: Andy Gospodarek go...@cumulusnetworks.com Suggested-by: Dinesh Dutt dd...@cumulusnetworks.com --- Though there were some that preferred not to have a configuration option and to make this behavior the default when it was discussed in Ottawa earlier this year since it was time to do this. I wanted to propose the config option to preserve the current behavior for those that desire it. I'll happily remove it if Dave and Linus approve. An IPv6 implementation is also needed (DECnet too!), but I wanted to start with the IPv4 implementation to get people comfortable with the idea before moving forward. If this is accepted the IPv6 implementation can be posted shortly. FWIW, we have been running this patch with the sysctl setting above and our customers have been happily using a backported version for IPv4 and IPv6 for 6 months. include/linux/netdevice.h | 1 + include/net/fib_rules.h| 1 + include/net/ip_fib.h | 1 + include/uapi/linux/rtnetlink.h | 1 + include/uapi/linux/sysctl.h| 1 + kernel/sysctl_binary.c | 1 + net/core/dev.c | 2 ++ net/core/sysctl_net_core.c | 7 +++ net/ipv4/fib_frontend.c| 12 +-- net/ipv4/fib_rules.c | 7 ++- net/ipv4/fib_semantics.c | 46 -- net/ipv4/fib_trie.c| 19 + 12 files changed, 86 insertions(+), 13 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 6f5f71f..5bd953c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2986,6 +2986,7 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb); extern int netdev_budget; +extern int kill_routes_on_linkdown; /* Called by rtnetlink.c:rtnl_unlock() */ void netdev_run_todo(void); diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index 6d67383..4fbfda5 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -37,6 +37,7 @@ struct fib_lookup_arg { struct fib_rule *rule; int flags; #define FIB_LOOKUP_NOREF 1 +#define FIB_LOOKUP_ALLOWDEAD 2 }; struct fib_rules_ops { diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 54271ed..efb195b 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -250,6 +250,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id); struct fib_table *fib_get_table(struct net *net, u32 id); int __fib_lookup(struct net