[PATCH] net: hns: dereference ppe_cb->ppe_common_cb if it is non-null
From: Colin Ian King ppe_cb->ppe_common_cb is being dereferenced before a null check is being made on it. If ppe_cb->ppe_common_cb is null then we end up with a null pointer dereference when assigning dsaf_dev. Fix this by moving the initialisation of dsaf_dev once we know ppe_cb->ppe_common_cb is OK to dereference. Signed-off-by: Colin Ian King --- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c index ff8b6a4..6ea8722 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c @@ -328,9 +328,10 @@ static void hns_ppe_init_hw(struct hns_ppe_cb *ppe_cb) static void hns_ppe_uninit_hw(struct hns_ppe_cb *ppe_cb) { u32 port; - struct dsaf_device *dsaf_dev = ppe_cb->ppe_common_cb->dsaf_dev; if (ppe_cb->ppe_common_cb) { + struct dsaf_device *dsaf_dev = ppe_cb->ppe_common_cb->dsaf_dev; + port = ppe_cb->index; dsaf_dev->misc_op->ppe_srst(dsaf_dev, port, 0); } -- 2.9.3
Re: [PATCH 0/5] Networking cgroup controller
On Tue, Aug 23, 2016 at 1:49 AM, Parav Pandit wrote: > Hi Anoop, > > Regardless of usecase, I think this functionality is best handled as > LSM functionality instead of cgroup. > I'm not so sure about that. Cgroup APIs are useful and this is just an extension to it. > Tasks which are proposed in this patch are related to access control checks. > LSM already has required hooks for socket operations such as bind(), > listen() as few small examples. > > Refer to security_socket_listen() which invokes LSM specific hooks. > This is invoked in source/net/socket.c as part of listen() system call. > LSM hook callback can check whether a given a process can listen to > requested UDP port or not. > This has administrative overhead that is not addressed. The underlying cgroup infrastructure takes care of it in this (current) implementation. > Parav > > [...]
[PATCH] softirq: fix tasklet_kill() and its users
Semantically the expectation from the tasklet init/kill API should be as below. tasklet_init() == Init and Enable scheduling tasklet_kill() == Disable scheduling and Destroy tasklet_init() API exibit above behavior but not the tasklet_kill(). The tasklet handler can still get scheduled and run even after the tasklet_kill(). There are 2, 3 places where drivers are working around this issue by calling tasklet_disable() which will add an usecount and there by avoiding the handlers being called. tasklet_enable/tasklet_disable is a pair API and expected to be used together. Usage of tasklet_disable() *just* to workround tasklet scheduling after kill is probably not the correct and inteded use of the API as done the API. We also happen to see similar issue where in shutdown path the tasklet_handler was getting called even after the tasklet_kill(). We fix this be making sure tasklet_kill() does right thing and there by ensuring tasklet handler won't run after tasklet_kil() with very simple change. Patch fixes the tasklet code and also few drivers workarounds. Cc: Greg Kroah-Hartman Cc: Andrew Morton Cc: Thomas Gleixner Cc: Tadeusz Struk Cc: Herbert Xu Cc: "David S. Miller" Cc: Paul Bolle Cc: Giovanni Cabiddu Cc: Salvatore Benedetto Cc: Karsten Keil Cc: "Peter Zijlstra (Intel)" Signed-off-by: Santosh Shilimkar --- Removed RFC tag from last post and dropped atmel serial driver which seems to have been fixed in 4.8 https://lkml.org/lkml/2016/8/7/7 drivers/crypto/qat/qat_common/adf_isr.c| 1 - drivers/crypto/qat/qat_common/adf_sriov.c | 1 - drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 -- drivers/isdn/gigaset/interface.c | 1 - kernel/softirq.c | 7 --- 5 files changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/crypto/qat/qat_common/adf_isr.c b/drivers/crypto/qat/qat_common/adf_isr.c index 06d4901..fd5e900 100644 --- a/drivers/crypto/qat/qat_common/adf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_isr.c @@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) int i; for (i = 0; i < hw_data->num_banks; i++) { - tasklet_disable(&priv_data->banks[i].resp_handler); tasklet_kill(&priv_data->banks[i].resp_handler); } } diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c b/drivers/crypto/qat/qat_common/adf_sriov.c index 9320ae1..bc7c2fa 100644 --- a/drivers/crypto/qat/qat_common/adf_sriov.c +++ b/drivers/crypto/qat/qat_common/adf_sriov.c @@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev) } for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) { - tasklet_disable(&vf->vf2pf_bh_tasklet); tasklet_kill(&vf->vf2pf_bh_tasklet); mutex_destroy(&vf->pf2vf_lock); } diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c b/drivers/crypto/qat/qat_common/adf_vf_isr.c index bf99e11..6e38bff 100644 --- a/drivers/crypto/qat/qat_common/adf_vf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c @@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev *accel_dev) static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev) { - tasklet_disable(&accel_dev->vf.pf2vf_bh_tasklet); tasklet_kill(&accel_dev->vf.pf2vf_bh_tasklet); mutex_destroy(&accel_dev->vf.vf2pf_lock); } @@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) { struct adf_etr_data *priv_data = accel_dev->transport; - tasklet_disable(&priv_data->banks[0].resp_handler); tasklet_kill(&priv_data->banks[0].resp_handler); } diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c index 600c79b..2ce63b6 100644 --- a/drivers/isdn/gigaset/interface.c +++ b/drivers/isdn/gigaset/interface.c @@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs) if (!drv->have_tty) return; - tasklet_disable(&cs->if_wake_tasklet); tasklet_kill(&cs->if_wake_tasklet); cs->tty_dev = NULL; tty_unregister_device(drv->tty, cs->minor_index); diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b..21397eb 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a) list = list->next; if (tasklet_trylock(t)) { - if (!atomic_read(&t->count)) { + if (atomic_read(&t->count) == 1) { if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) BUG(); @@ -534,7 +534,7 @@ static void tasklet_hi_action(struct softirq_action *a) list = list->next; if (tasklet_trylock(t)) { - if (!atomic_read(&t->count)) { + if (atomic_read(&t
[PATCH net-next 5/6] net: dsa: bcm_sf2: Utilize core B53 driver when possible
The Broadcom Starfighter2 is almost entirely register compatible with B53, yet for historical reasons came up first in the tree and is now being updated to utilize b53_common.c to the fullest extent possible. A few things need to be adjusted to allow that: - the switch "core" registers currently operate on a 32-bit address, whereas b53 passes a page + reg pair to offset from, so we need to convert that, thankfully there is a generic formula to do that - the link managemenent is not self contained with the B53/CORE register set, but instead is in the SWITCH_REG block which is part of the integration glue logic, so we keep that entirely custom here because this really is part of the existing bcm_sf2 implementation - there are additional power management constraints on the port's memories that make us keep the port_enable/disable callbacks custom for now, also, we support tagging whereas b53_common does not support that yet All the VLAN and bridge code is entirely identical though so, avoid duplicating it. Other things will be migrated in the future like EEE and possibly Wake-on-LAN. Signed-off-by: Florian Fainelli --- drivers/net/dsa/Kconfig | 1 + drivers/net/dsa/bcm_sf2.c | 230 -- drivers/net/dsa/bcm_sf2.h | 11 +++ 3 files changed, 195 insertions(+), 47 deletions(-) diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig index 8f4544394f44..de6d04429a70 100644 --- a/drivers/net/dsa/Kconfig +++ b/drivers/net/dsa/Kconfig @@ -16,6 +16,7 @@ config NET_DSA_BCM_SF2 select FIXED_PHY select BCM7XXX_PHY select MDIO_BCM_UNIMAC + select B53 ---help--- This enables support for the Broadcom Starfighter 2 Ethernet switch chips. diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index b47a74b37a42..56e898f01c0f 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -29,9 +29,12 @@ #include #include #include +#include #include "bcm_sf2.h" #include "bcm_sf2_regs.h" +#include "b53/b53_priv.h" +#include "b53/b53_regs.h" /* String, offset, and register size in bytes if different from 4 bytes */ static const struct bcm_sf2_hw_stats bcm_sf2_mib[] = { @@ -106,7 +109,7 @@ static void bcm_sf2_sw_get_strings(struct dsa_switch *ds, static void bcm_sf2_sw_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); const struct bcm_sf2_hw_stats *s; unsigned int i; u64 val = 0; @@ -143,7 +146,7 @@ static enum dsa_tag_protocol bcm_sf2_sw_get_tag_protocol(struct dsa_switch *ds) static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, int cpu_port) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); unsigned int i; u32 reg; @@ -163,7 +166,7 @@ static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, int cpu_port) static void bcm_sf2_imp_setup(struct dsa_switch *ds, int port) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); u32 reg, val; /* Enable the port memories */ @@ -228,7 +231,7 @@ static void bcm_sf2_imp_setup(struct dsa_switch *ds, int port) static void bcm_sf2_eee_enable_set(struct dsa_switch *ds, int port, bool enable) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); u32 reg; reg = core_readl(priv, CORE_EEE_EN_CTRL); @@ -241,7 +244,7 @@ static void bcm_sf2_eee_enable_set(struct dsa_switch *ds, int port, bool enable) static void bcm_sf2_gphy_enable_set(struct dsa_switch *ds, bool enable) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); u32 reg; reg = reg_readl(priv, REG_SPHY_CNTRL); @@ -315,7 +318,7 @@ static inline void bcm_sf2_port_intr_disable(struct bcm_sf2_priv *priv, static int bcm_sf2_port_setup(struct dsa_switch *ds, int port, struct phy_device *phy) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); s8 cpu_port = ds->dst[ds->index].cpu_port; u32 reg; @@ -371,7 +374,7 @@ static int bcm_sf2_port_setup(struct dsa_switch *ds, int port, static void bcm_sf2_port_disable(struct dsa_switch *ds, int port, struct phy_device *phy) { - struct bcm_sf2_priv *priv = ds_to_priv(ds); + struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); u32 off, reg; if (priv->wol_ports_mask & (1 << port)) @@ -403,7 +406,7 @@ static void bcm_sf2_port_disable(struct dsa_switch *ds, int port, static int bcm_sf2_eee_init(struct dsa_switch *ds, int port, str
[PATCH net-next 3/6] net: dsa: b53: Define SF2 MIB layout
The 58xx and 7445 chips use the Starfighter2 code, define its MIB layout and introduce a helper function: is58xx() which checks for both of these IDs for now. Signed-off-by: Florian Fainelli --- drivers/net/dsa/b53/b53_common.c | 63 drivers/net/dsa/b53/b53_priv.h | 6 2 files changed, 69 insertions(+) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 0e6b8125a8ea..e59d799880e4 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -167,6 +167,65 @@ static const struct b53_mib_desc b53_mibs[] = { #define B53_MIBS_SIZE ARRAY_SIZE(b53_mibs) +static const struct b53_mib_desc b53_mibs_58xx[] = { + { 8, 0x00, "TxOctets" }, + { 4, 0x08, "TxDropPkts" }, + { 4, 0x0c, "TxQPKTQ0" }, + { 4, 0x10, "TxBroadcastPkts" }, + { 4, 0x14, "TxMulticastPkts" }, + { 4, 0x18, "TxUnicastPKts" }, + { 4, 0x1c, "TxCollisions" }, + { 4, 0x20, "TxSingleCollision" }, + { 4, 0x24, "TxMultipleCollision" }, + { 4, 0x28, "TxDeferredCollision" }, + { 4, 0x2c, "TxLateCollision" }, + { 4, 0x30, "TxExcessiveCollision" }, + { 4, 0x34, "TxFrameInDisc" }, + { 4, 0x38, "TxPausePkts" }, + { 4, 0x3c, "TxQPKTQ1" }, + { 4, 0x40, "TxQPKTQ2" }, + { 4, 0x44, "TxQPKTQ3" }, + { 4, 0x48, "TxQPKTQ4" }, + { 4, 0x4c, "TxQPKTQ5" }, + { 8, 0x50, "RxOctets" }, + { 4, 0x58, "RxUndersizePkts" }, + { 4, 0x5c, "RxPausePkts" }, + { 4, 0x60, "RxPkts64Octets" }, + { 4, 0x64, "RxPkts65to127Octets" }, + { 4, 0x68, "RxPkts128to255Octets" }, + { 4, 0x6c, "RxPkts256to511Octets" }, + { 4, 0x70, "RxPkts512to1023Octets" }, + { 4, 0x74, "RxPkts1024toMaxPktsOctets" }, + { 4, 0x78, "RxOversizePkts" }, + { 4, 0x7c, "RxJabbers" }, + { 4, 0x80, "RxAlignmentErrors" }, + { 4, 0x84, "RxFCSErrors" }, + { 8, 0x88, "RxGoodOctets" }, + { 4, 0x90, "RxDropPkts" }, + { 4, 0x94, "RxUnicastPkts" }, + { 4, 0x98, "RxMulticastPkts" }, + { 4, 0x9c, "RxBroadcastPkts" }, + { 4, 0xa0, "RxSAChanges" }, + { 4, 0xa4, "RxFragments" }, + { 4, 0xa8, "RxJumboPkt" }, + { 4, 0xac, "RxSymblErr" }, + { 4, 0xb0, "InRangeErrCount" }, + { 4, 0xb4, "OutRangeErrCount" }, + { 4, 0xb8, "EEELpiEvent" }, + { 4, 0xbc, "EEELpiDuration" }, + { 4, 0xc0, "RxDiscard" }, + { 4, 0xc8, "TxQPKTQ6" }, + { 4, 0xcc, "TxQPKTQ7" }, + { 4, 0xd0, "TxPkts64Octets" }, + { 4, 0xd4, "TxPkts65to127Octets" }, + { 4, 0xd8, "TxPkts128to255Octets" }, + { 4, 0xdc, "TxPkts256to511Ocets" }, + { 4, 0xe0, "TxPkts512to1023Ocets" }, + { 4, 0xe4, "TxPkts1024toMaxPktOcets" }, +}; + +#define B53_MIBS_58XX_SIZE ARRAY_SIZE(b53_mibs_58xx) + static int b53_do_vlan_op(struct b53_device *dev, u8 op) { unsigned int i; @@ -635,6 +694,8 @@ static const struct b53_mib_desc *b53_get_mib(struct b53_device *dev) return b53_mibs_65; else if (is63xx(dev)) return b53_mibs_63xx; + else if (is58xx(dev)) + return b53_mibs_58xx; else return b53_mibs; } @@ -645,6 +706,8 @@ static unsigned int b53_get_mib_size(struct b53_device *dev) return B53_MIBS_65_SIZE; else if (is63xx(dev)) return B53_MIBS_63XX_SIZE; + else if (is58xx(dev)) + return B53_MIBS_58XX_SIZE; else return B53_MIBS_SIZE; } diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h index cf2ff2cbc8ab..76672dae412d 100644 --- a/drivers/net/dsa/b53/b53_priv.h +++ b/drivers/net/dsa/b53/b53_priv.h @@ -175,6 +175,12 @@ static inline int is5301x(struct b53_device *dev) dev->chip_id == BCM53019_DEVICE_ID; } +static inline int is58xx(struct b53_device *dev) +{ + return dev->chip_id == BCM58XX_DEVICE_ID || + dev->chip_id == BCM7445_DEVICE_ID; +} + #define B53_CPU_PORT_255 #define B53_CPU_PORT 8 -- 2.7.4
[PATCH net-next 1/6] net: dsa: b53: Initialize ds->drv in b53_switch_alloc
In order to alloc drivers to override specific dsa_switch_driver callbacks, initialize ds->drv to b53_switch_ops earlier, which avoids having to expose this structure to glue drivers. Signed-off-by: Florian Fainelli --- drivers/net/dsa/b53/b53_common.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 65ecb51f99e5..30377ceb1928 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -1602,7 +1602,6 @@ static const struct b53_chip_data b53_switch_chips[] = { static int b53_switch_init(struct b53_device *dev) { - struct dsa_switch *ds = dev->ds; unsigned int i; int ret; @@ -1618,7 +1617,6 @@ static int b53_switch_init(struct b53_device *dev) dev->vta_regs[1] = chip->vta_regs[1]; dev->vta_regs[2] = chip->vta_regs[2]; dev->jumbo_pm_reg = chip->jumbo_pm_reg; - ds->drv = &b53_switch_ops; dev->cpu_port = chip->cpu_port; dev->num_vlans = chip->vlans; dev->num_arl_entries = chip->arl_entries; @@ -1706,6 +1704,7 @@ struct b53_device *b53_switch_alloc(struct device *base, dev->ds = ds; dev->priv = priv; dev->ops = ops; + ds->drv = &b53_switch_ops; mutex_init(&dev->reg_mutex); mutex_init(&dev->stats_mutex); -- 2.7.4
Re: [net] i40e: Change some init flow for the client
On Wed, 2016-08-24 at 17:51 -0700, Jeff Kirsher wrote: > From: Anjali Singhai Jain > > This change makes a common flow for Client instance open during init > and reset path. The Client subtask can handle both the cases instead of > making a separate notify_client_of_open call. > Also it may fix a bug during reset where the service task was leaking > some memory and causing issues. > > Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd > Signed-off-by: Anjali Singhai Jain > Tested-by: Andrew Bowers > Signed-off-by: Jeff Kirsher > --- > drivers/net/ethernet/intel/i40e/i40e_client.c | 41 - > -- > drivers/net/ethernet/intel/i40e/i40e_main.c | 1 - > 2 files changed, 30 insertions(+), 12 deletions(-) While the original patch description did not call this out clearly, this patch fixes an issue with the RDMA/iWARP driver i40iw, which would randomly crash or hang without these changes. signature.asc Description: This is a digitally signed message part
Improving OCTEON II 10G Ethernet performance
I'm trying to migrate from the Octeon SDK to a vanilla Linux 4.4 kernel for a Cavium OCTEON II (CN6880) board running in 64-bit little-endian mode. So far I've gotten most of the hardware features I need working, including XAUI/RXAUI, USB, boot bus and I2C, with a fairly small set of patches. https://github.com/skyportsystems/linux/compare/master...octeon2 The biggest remaining hurdle is improving 10G Ethernet performance: iperf -P 10 on the SDK kernel gets close to 10 Gbit/sec throughput, while on my 4.4 kernel, it tops out around 1 Gbit/sec. Comparing the octeon-ethernet driver in the SDK (http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/tree/drivers/net/ethernet/octeon?h=apaliwal/octeon) against the one in 4.4, the latter appears to utilize only a single CPU core for the rx path. It's not clear to me if there is a similar issue on the tx side, or other bottlenecks. I started trying to port multi-CPU rx from the SDK octeon-ethernet driver, but had trouble teasing out just the necessary bits without following a maze of dependencies on unrelated functions. (Dragging major parts of the SDK wholesale into 4.4 defeats the purpose of switching to a vanilla kernel, and doesn't bring us closer to getting octeon-ethernet out of staging.) Has there been any work on the octeon-ethernet driver since this patch set? https://www.linux-mips.org/archives/linux-mips/2015-08/msg00338.html Any hints on what to pick out of the SDK code to improve 10G performance would be appreciated. --Ed
[PATCH net] veth: sctp: add NETIF_F_SCTP_CRC to device features
Commit b17c706987fa ("loopback: sctp: add NETIF_F_SCTP_CSUM to device features") added NETIF_F_SCTP_CRC to device features for lo device to improve the performance of sctp over lo. This patch is to add NETIF_F_SCTP_CRC to device features for veth to improve the performance of sctp over veth. Before this patch: ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 212992 212992 1024010.001117.16 After this patch: ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 212992 212992 1024010.201415.22 Tested-by: Li Shuang Signed-off-by: Xin Long --- drivers/net/veth.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f37a6e6..4bda502 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -313,7 +313,7 @@ static const struct net_device_ops veth_netdev_ops = { }; #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \ - NETIF_F_RXCSUM | NETIF_F_HIGHDMA | \ + NETIF_F_RXCSUM | NETIF_F_SCTP_CRC | NETIF_F_HIGHDMA | \ NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \ NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \ NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX ) -- 2.1.0
[PATCH net-next 0/6] net: dsa: Make bcm_sf2 utilize b53_common
Hi all, This patch series makes the bcm_sf2 driver utilize a large number of the core functions offered by the b53_common driver since the SWITCH_CORE registers are mostly register compatible with the switches driven by b53_common. In order to accomplish that, we just override the dsa_driver_ops callbacks that we need to. There are still integration specific logic from the bcm_sf2 that we cannot absorb into b53_common because it is just not there, mostly in the area of link management and power management, but most of the features are within b53_common now: VLAN, FDB, bridge Along the process, we also improve support for the BCM58xx SoCs, since those also have the same version of the switching IP that 7445 has (for which bcm_sf2 was developed). Florian Fainelli (6): net: dsa: b53: Initialize ds->drv in b53_switch_alloc net: dsa: b53: Prepare to support 7445 switch net: dsa: b53: Define SF2 MIB layout net: dsa: b53: Add JOIN_ALL_VLAN support net: dsa: bcm_sf2: Utilize core B53 driver when possible net: dsa: bcm_sf2: Remove duplicate code drivers/net/dsa/Kconfig | 1 + drivers/net/dsa/b53/b53_common.c | 108 - drivers/net/dsa/b53/b53_priv.h | 7 + drivers/net/dsa/b53/b53_regs.h | 3 + drivers/net/dsa/bcm_sf2.c| 932 +++ drivers/net/dsa/bcm_sf2.h| 82 +--- drivers/net/dsa/bcm_sf2_regs.h | 122 - 7 files changed, 288 insertions(+), 967 deletions(-) -- 2.7.4
[PATCH net-next 2/6] net: dsa: b53: Prepare to support 7445 switch
Allocate a device entry for the Broadcom BCM7445 integrated switch currently backed by bcm_sf2.c. Since this is the latest generation, it has 4 ARL entries, 4K VLANs and uses Port 8 for the CPU/IMP port. Signed-off-by: Florian Fainelli --- drivers/net/dsa/b53/b53_common.c | 12 drivers/net/dsa/b53/b53_priv.h | 1 + 2 files changed, 13 insertions(+) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 30377ceb1928..0e6b8125a8ea 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -1598,6 +1598,18 @@ static const struct b53_chip_data b53_switch_chips[] = { .jumbo_pm_reg = B53_JUMBO_PORT_MASK, .jumbo_size_reg = B53_JUMBO_MAX_SIZE, }, + { + .chip_id = BCM7445_DEVICE_ID, + .dev_name = "BCM7445", + .vlans = 4096, + .enabled_ports = 0x1ff, + .arl_entries = 4, + .cpu_port = B53_CPU_PORT, + .vta_regs = B53_VTA_REGS, + .duplex_reg = B53_DUPLEX_STAT_GE, + .jumbo_pm_reg = B53_JUMBO_PORT_MASK, + .jumbo_size_reg = B53_JUMBO_MAX_SIZE, + }, }; static int b53_switch_init(struct b53_device *dev) diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h index d268493a5fec..cf2ff2cbc8ab 100644 --- a/drivers/net/dsa/b53/b53_priv.h +++ b/drivers/net/dsa/b53/b53_priv.h @@ -60,6 +60,7 @@ enum { BCM53018_DEVICE_ID = 0x53018, BCM53019_DEVICE_ID = 0x53019, BCM58XX_DEVICE_ID = 0x5800, + BCM7445_DEVICE_ID = 0x7445, }; #define B53_N_PORTS9 -- 2.7.4
Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list
> Or add a refcnt to its members. > NETDEV_UP, it gets a ++ if it's already there > NETDEV_DOWN, it gets a -- and cleans it up if it reaches 0 > And the rest probably could stay the same. > Yes, it could also avoid the issue of amounts of duplicate addrs. or add a nic index variable to its members. But I still prefer the current patch. 1. This issue only happens when server bind 'ANY' addresses. we don't need to add any new members to struct sctp_sockaddr_entry. especially if it's a really corner issue, we fix this as an improvement. 2. It's yet two issues here, the duplicate addrs may be from a) different local NICs. b) the same one NIC. It may be unexpectable to filter them in NETDEV_UP/DOWN events. 3. We check it only when sctp really binds it, just like sctp_do_bind. What do you think ?
[PATCH net-next v4 2/3] net: mpls: Fixups for GSO
As reported by Lennert the MPLS GSO code is failing to properly segment large packets. There are a couple of problems: 1. the inner protocol is not set so the gso segment functions for inner protocol layers are not getting run, and 2 MPLS labels for packets that use the "native" (non-OVS) MPLS code are not properly accounted for in mpls_gso_segment. The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment to call the gso segment functions for the higher layer protocols. That means skb_mac_gso_segment is called twice -- once with the network protocol set to MPLS and again with the network protocol set to the inner protocol. This patch sets the inner skb protocol addressing item 1 above and sets the network_header and inner_network_header to mark where the MPLS labels start and end. The MPLS code in OVS is also updated to set the two network markers. >From there the MPLS GSO code uses the difference between the network header and the inner network header to know the size of the MPLS header that was pushed. It then pulls the MPLS header, resets the mac_len and protocol for the inner protocol and then calls skb_mac_gso_segment to segment the skb. Afterward the inner protocol segmentation is done the skb protocol is set to mpls for each segment and the network and mac headers restored. Reported-by: Lennert Buytenhek Signed-off-by: David Ahern --- net/mpls/mpls_gso.c | 40 +--- net/mpls/mpls_iptunnel.c | 4 net/openvswitch/actions.c | 9 +++-- 3 files changed, 40 insertions(+), 13 deletions(-) diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c index 2055e57ed1c3..b4da6d8e8632 100644 --- a/net/mpls/mpls_gso.c +++ b/net/mpls/mpls_gso.c @@ -23,32 +23,50 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb, netdev_features_t features) { struct sk_buff *segs = ERR_PTR(-EINVAL); + u16 mac_offset = skb->mac_header; netdev_features_t mpls_features; + u16 mac_len = skb->mac_len; __be16 mpls_protocol; + unsigned int mpls_hlen; + + skb_reset_network_header(skb); + mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb); + if (unlikely(!pskb_may_pull(skb, mpls_hlen))) + goto out; /* Setup inner SKB. */ mpls_protocol = skb->protocol; skb->protocol = skb->inner_protocol; - /* Push back the mac header that skb_mac_gso_segment() has pulled. -* It will be re-pulled by the call to skb_mac_gso_segment() below -*/ - __skb_push(skb, skb->mac_len); + __skb_pull(skb, mpls_hlen); + + skb->mac_len = 0; + skb_reset_mac_header(skb); /* Segment inner packet. */ mpls_features = skb->dev->mpls_features & features; segs = skb_mac_gso_segment(skb, mpls_features); + if (IS_ERR_OR_NULL(segs)) { + skb_gso_error_unwind(skb, mpls_protocol, mpls_hlen, mac_offset, +mac_len); + goto out; + } + skb = segs; + + mpls_hlen += mac_len; + do { + skb->mac_len = mac_len; + skb->protocol = mpls_protocol; + skb_reset_inner_network_header(skb); - /* Restore outer protocol. */ - skb->protocol = mpls_protocol; + __skb_push(skb, mpls_hlen); - /* Re-pull the mac header that the call to skb_mac_gso_segment() -* above pulled. It will be re-pushed after returning -* skb_mac_gso_segment(), an indirect caller of this function. -*/ - __skb_pull(skb, skb->data - skb_mac_header(skb)); + skb_reset_mac_header(skb); + skb_set_network_header(skb, mac_len); + } while ((skb = skb->next)); +out: return segs; } diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index aed872cc05a6..cf52cf30ac4b 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -90,7 +90,11 @@ static int mpls_xmit(struct sk_buff *skb) if (skb_cow(skb, hh_len + new_header_size)) goto drop; + skb_set_inner_protocol(skb, skb->protocol); + skb_reset_inner_network_header(skb); + skb_push(skb, new_header_size); + skb_reset_network_header(skb); skb->dev = out_dev; diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 1ecbd7715f6d..ca91fc33f8a9 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -162,10 +162,16 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, if (skb_cow_head(skb, MPLS_HLEN) < 0) return -ENOMEM; + if (!skb->inner_protocol) { + skb_set_inner_network_header(skb, skb->mac_len); + skb_set_inner_protocol(skb, skb->protocol); + } + skb_push(skb, MPLS_HLEN); memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_
RE: [RFC PATCH 3/5] bnx2x: Add support for segmentation of tunnels with outer checksums
> >> This patch assumes that the bnx2x hardware will ignore existing > >> IPv4/v6 header fields for length and checksum as well as the length > >> and checksum fields for outer UDP and GRE headers. > >> > >> I have no means of testing this as I do not have any bnx2x hardware > >> but thought I would submit it as an RFC to see if anyone out there > >> wants to test this and see if this does in fact enable this > >> functionality allowing us to to segment tunneled frames that have an outer > checksum. > >> > >> Signed-off-by: Alexander Duyck > > > > So it took me some [well, a lot] time to reach this, but I've finally gave > > it a try. > > I saw a performance boost with the partial support - Throughput for > > vxlan tunnels with and without udpcsum were almost identical after > > this, whereas without this patch the udpcsum prevented GSO and a > > TCP/IPv4 connection on top of it got roughly half the throughput. > > > > However, I did encounter one oddity I couldn't explain - After I've > > disabled tx-udp_tnl-segmentation via ethtool on the base interface, > > got left with: > >tx-gso-partial: on > >tx-udp_tnl-segmentation: off > >tx-udp_tnl-csum-segmentation: on > > > > When I ran traffic over both vxlan tunnels the one with the udpcsum > > was still Passing gso aggregations to base device to transmit [and the > > throughput was same as before], where's the tunnel without the udpcsum > > showed only MTU-sized packets reaching the base interface for > > transmission [which is what I've expected] > > > > Any idea why that happened? > > So the way they are implemented tx-udp_tnl-segmentation and tx-udp_tnl- > csum-segmentation are treated as two separate features. > The kernel currently gives them the same treatment as NETIF_F_TSO and > NETIF_F_TSO6. You can disable one and the other still functions. > > Now if you disable tx-gso-partial you should expect to see tx-udp_tnl-csum- > segmentation be disabled because it is dependent on the partial GSO offload. > > - Alex O.k., thanks. Then I'll run some more testing scenarios, and assuming everything works fine I'll re-send this. Alex - should I place you at the 'from' field?
[PATCH net-next v4 0/3] net: mpls: fragmentation and gso fixes for locally originated traffic
This series fixes mtu and fragmentation for tunnels using lwtunnel output redirect, and fixes GSO for MPLS for locally originated traffic reported by Lennert Buytenhek. A follow on series will address fragmentation and GSO for forwarded MPLS traffic. Hardware offload of GSO with MPLS also needs to be addressed. Simon: Can you verify this works with OVS for single and multiple labels? v4 - more updates to mpls_gso_segment per Alex's comments (thanks, Alex) - updates to teaching OVS about marking MPLS labels as the network header v3 - updates to mpls_gso_segment per Alex's comments - dropped skb->encapsulation = 1 from mpls_xmit per Alex's comment v2 - consistent use of network_header in skb to fix GSO for MPLS - update MPLS code in OVS to network_header and inner_network_header David Ahern (2): net: mpls: Fixups for GSO net: veth: Set features for MPLS Roopa Prabhu (1): net: lwtunnel: Handle fragmentation drivers/net/veth.c| 1 + include/net/lwtunnel.h| 44 net/core/lwtunnel.c | 35 +++ net/ipv4/ip_output.c | 8 net/ipv4/route.c | 4 +++- net/ipv6/ip6_output.c | 8 net/ipv6/route.c | 4 +++- net/mpls/mpls_gso.c | 40 +--- net/mpls/mpls_iptunnel.c | 13 + net/openvswitch/actions.c | 9 +++-- 10 files changed, 147 insertions(+), 19 deletions(-) -- 2.1.4
[PATCH net-next v4 1/3] net: lwtunnel: Handle fragmentation
From: Roopa Prabhu Today mpls iptunnel lwtunnel_output redirect expects the tunnel output function to handle fragmentation. This is ok but can be avoided if we did not do the mpls output redirect too early. ie we could wait until ip fragmentation is done and then call mpls output for each ip fragment. To make this work we will need, 1) the lwtunnel state to carry encap headroom 2) and do the redirect to the encap output handler on the ip fragment (essentially do the output redirect after fragmentation) This patch adds tunnel headroom in lwtstate to make sure we account for tunnel data in mtu calculations during fragmentation and adds new xmit redirect handler to redirect to lwtunnel xmit func after ip fragmentation. This includes IPV6 and some mtu fixes and testing from David Ahern. Signed-off-by: Roopa Prabhu Signed-off-by: David Ahern --- include/net/lwtunnel.h | 44 net/core/lwtunnel.c | 35 +++ net/ipv4/ip_output.c | 8 net/ipv4/route.c | 4 +++- net/ipv6/ip6_output.c| 8 net/ipv6/route.c | 4 +++- net/mpls/mpls_iptunnel.c | 9 + 7 files changed, 106 insertions(+), 6 deletions(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index e9f116e29c22..ea3f80f58fd6 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -13,6 +13,13 @@ /* lw tunnel state flags */ #define LWTUNNEL_STATE_OUTPUT_REDIRECT BIT(0) #define LWTUNNEL_STATE_INPUT_REDIRECT BIT(1) +#define LWTUNNEL_STATE_XMIT_REDIRECT BIT(2) + +enum { + LWTUNNEL_XMIT_DONE, + LWTUNNEL_XMIT_CONTINUE, +}; + struct lwtunnel_state { __u16 type; @@ -21,6 +28,7 @@ struct lwtunnel_state { int (*orig_output)(struct net *net, struct sock *sk, struct sk_buff *skb); int (*orig_input)(struct sk_buff *); int len; + __u16 headroom; __u8data[0]; }; @@ -34,6 +42,7 @@ struct lwtunnel_encap_ops { struct lwtunnel_state *lwtstate); int (*get_encap_size)(struct lwtunnel_state *lwtstate); int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b); + int (*xmit)(struct sk_buff *skb); }; #ifdef CONFIG_LWTUNNEL @@ -75,6 +84,24 @@ static inline bool lwtunnel_input_redirect(struct lwtunnel_state *lwtstate) return false; } + +static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate) +{ + if (lwtstate && (lwtstate->flags & LWTUNNEL_STATE_XMIT_REDIRECT)) + return true; + + return false; +} + +static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate, +unsigned int mtu) +{ + if (lwtunnel_xmit_redirect(lwtstate) && lwtstate->headroom < mtu) + return lwtstate->headroom; + + return 0; +} + int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, unsigned int num); int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, @@ -90,6 +117,7 @@ struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len); int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b); int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb); int lwtunnel_input(struct sk_buff *skb); +int lwtunnel_xmit(struct sk_buff *skb); #else @@ -117,6 +145,17 @@ static inline bool lwtunnel_input_redirect(struct lwtunnel_state *lwtstate) return false; } +static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate) +{ + return false; +} + +static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate, +unsigned int mtu) +{ + return 0; +} + static inline int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, unsigned int num) { @@ -170,6 +209,11 @@ static inline int lwtunnel_input(struct sk_buff *skb) return -EOPNOTSUPP; } +static inline int lwtunnel_xmit(struct sk_buff *skb) +{ + return -EOPNOTSUPP; +} + #endif /* CONFIG_LWTUNNEL */ #define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" __stringify(encap_type)) diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index 669ecc9f884e..e5f84c26ba1a 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -251,6 +251,41 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb) } EXPORT_SYMBOL(lwtunnel_output); +int lwtunnel_xmit(struct sk_buff *skb) +{ + struct dst_entry *dst = skb_dst(skb); + const struct lwtunnel_encap_ops *ops; + struct lwtunnel_state *lwtstate; + int ret = -EINVAL; + + if (!dst) + goto drop; + + lwtstate = dst->lwtstate; + + if (lwtstate->type == LWTUNNEL_ENCAP_NONE || + lwtstate->type > LWTUNNEL_EN
Re: A second case of XPS considerably reducing single-stream performance
Also, while it doesn't seem to have the same massive effect on throughput, I can also see out of order behaviour happening when the sending VM is on a node with a ConnectX-3 Pro NIC. Its driver is also enabling XPS it would seem. I'm not *certain* but looking at the traces it appears that with the ConnectX-3 Pro there is more interleaving of the out-of-order traffic than there is with the Skyhawk. The ConnectX-3 Pro happens to be in a newer generation server with a newer processor than the other systems where I've seen this. I do not see the out-of-order behaviour when the NIC at the sending end is a BCM57840. It does not appear that the bnx2x driver in the 4.4 kernel is enabling XPS. So, it would seem that there are three cases of enabling XPS resulting in out-of-order traffic, two of which result in a non-trivial loss of performance. happy benchmarking, rick jones
Re: [PATCH net-next v2 1/2] net: diag: slightly refactor the inet_diag_bc_audit error checks.
From: Lorenzo Colitti Date: Wed, 24 Aug 2016 15:46:25 +0900 > This simplifies the code a bit and also allows inet_diag_bc_audit > to send to userspace an error that isn't EINVAL. > > Signed-off-by: Lorenzo Colitti Applied.
Re: [PATCH net-next v2 2/2] net: diag: allow socket bytecode filters to match socket marks
From: Lorenzo Colitti Date: Wed, 24 Aug 2016 15:46:26 +0900 > This allows a privileged process to filter by socket mark when > dumping sockets via INET_DIAG_BY_FAMILY. This is useful on > systems that use mark-based routing such as Android. > > The ability to filter socket marks requires CAP_NET_ADMIN, which > is consistent with other privileged operations allowed by the > SOCK_DIAG interface such as the ability to destroy sockets and > the ability to inspect BPF filters attached to packet sockets. > > Tested: https://android-review.googlesource.com/261350 > Signed-off-by: Lorenzo Colitti Applied.
Re: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to the Hisilicon RoCE Driver
From: Salil Mehta Date: Wed, 24 Aug 2016 04:44:48 +0800 > This patch is meant to add support of ACPI to the Hisilicon RoCE driver. > Following changes have been made in the driver(s): > > Patch 1/2: HNS Ethernet Driver: changes to support ACPI have been done in >the RoCE reset function part of the HNS ethernet driver. Earlier it only >supported DT/syscon. > > Patch 2/2. HNS RoCE driver: changes done in RoCE driver are meant to detect >the type and then either use DT specific or ACPI spcific functions. Where >ever possible, this patch tries to make use of "Unified Device Property >Interface" APIs to support both DT and ACPI through single interface. > > NOTE 1: ACPI changes done in both of the drivers depend upon the ACPI Table > (DSDT and IORT tables) changes part of UEFI/BIOS. These changes are NOT > part of this patch-set. > NOTE 2: Reset function in Patch 1/2 depends upon the reset function added in > ACPI tables(basically DSDT table) part of the UEFI/BIOS. Again, this > change is NOT reflected in this patch-set. I can't apply this series to my tree because the hns infiniband driver doesn't exist in it.
Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO
On 8/24/16 12:53 PM, David Ahern wrote: > What change is needed in pop_mpls? It already resets the mac_header and if > MPLS labels are removed there is no need to set network_header. I take it you > mean if the protocol is still MPLS and there are still labels then the > network header needs to be set and that means finding the bottom label. Does > OVS set the bottom of stack bit? From what I can tell OVS is not parsing the > MPLS label so no requirement that BOS is set. Without that there is no way to > tell when the labels are done short of guessing. I was confusing the inner network layer with the mpls network header. Just sent a v4. can you verify it works for single and multiple labels with OVS?
[PATCH net-next v4 3/3] net: veth: Set features for MPLS
veth does not really transmit packets only moves the skb from one netdev to another so gso and checksum is not really needed. Add the features to mpls_features to get the same benefit and performance with MPLS as without it. Reported-by: Lennert Buytenhek Signed-off-by: David Ahern --- drivers/net/veth.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f37a6e61d4ad..5db320a4d5cf 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -340,6 +340,7 @@ static void veth_setup(struct net_device *dev) dev->hw_features = VETH_FEATURES; dev->hw_enc_features = VETH_FEATURES; + dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE; } /* -- 2.1.4
[PATCH v1 1/1 net-next] 8139cp: Fix one possible deadloop in cp_rx_poll
From: Gao Feng When cp_rx_poll does not get enough packet, it will check the rx interrupt status again. If so, it will jumpt to rx_status_loop again. But the goto jump resets the rx variable as zero too. As a result, it causes one possible deadloop. Assume this case, rx_status_loop only gets the packet count which is less than budget, and (cpr16(IntrStatus) & cp_rx_intr_mask) condition is always true. It causes the deadloop happens and system is blocked. Signed-off-by: Gao Feng --- drivers/net/ethernet/realtek/8139cp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index deae10d..5297bf7 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -467,8 +467,8 @@ static int cp_rx_poll(struct napi_struct *napi, int budget) unsigned int rx_tail = cp->rx_tail; int rx; -rx_status_loop: rx = 0; +rx_status_loop: cpw16(IntrStatus, cp_rx_intr_mask); while (rx < budget) { -- 1.9.1
Re: [PATCH net-next] net: dsa: rename switch operations structure
From: Vivien Didelot Date: Tue, 23 Aug 2016 12:38:56 -0400 > Now that the dsa_switch_driver structure contains only function pointers > as it is supposed to, rename it to the more appropriate dsa_switch_ops, > uniformly to any other operations structure in the kernel. > > No functional changes here, basically just the result of something like: > s/dsa_switch_driver *drv/dsa_switch_ops *ops/g > > However keep the {un,}register_switch_driver functions and their > dsa_switch_drivers list as is, since they represent the -- likely to be > deprecated soon -- legacy DSA registration framework. > > In the meantime, also fix the following checks from checkpatch.pl to > make it happy with this patch: ... > Signed-off-by: Vivien Didelot Applied, thanks Vivien.
[PATCH net-next v3 1/2] net: ethernet: mediatek: modify to use the PDMA instead of the QDMA for Ethernet RX
Because the PDMA has richer features than the QDMA for Ethernet RX (such as multiple RX rings, HW LRO, etc.), the patch modifies to use the PDMA to handle Ethernet RX. Signed-off-by: Nelson Chang --- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 76 + drivers/net/ethernet/mediatek/mtk_eth_soc.h | 31 +++- 2 files changed, 74 insertions(+), 33 deletions(-) mode change 100644 => 100755 drivers/net/ethernet/mediatek/mtk_eth_soc.c mode change 100644 => 100755 drivers/net/ethernet/mediatek/mtk_eth_soc.h diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c old mode 100644 new mode 100755 index 1801fd8..cbeb793 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -342,25 +342,27 @@ static void mtk_mdio_cleanup(struct mtk_eth *eth) mdiobus_free(eth->mii_bus); } -static inline void mtk_irq_disable(struct mtk_eth *eth, u32 mask) +static inline void mtk_irq_disable(struct mtk_eth *eth, + unsigned reg, u32 mask) { unsigned long flags; u32 val; spin_lock_irqsave(ð->irq_lock, flags); - val = mtk_r32(eth, MTK_QDMA_INT_MASK); - mtk_w32(eth, val & ~mask, MTK_QDMA_INT_MASK); + val = mtk_r32(eth, reg); + mtk_w32(eth, val & ~mask, reg); spin_unlock_irqrestore(ð->irq_lock, flags); } -static inline void mtk_irq_enable(struct mtk_eth *eth, u32 mask) +static inline void mtk_irq_enable(struct mtk_eth *eth, + unsigned reg, u32 mask) { unsigned long flags; u32 val; spin_lock_irqsave(ð->irq_lock, flags); - val = mtk_r32(eth, MTK_QDMA_INT_MASK); - mtk_w32(eth, val | mask, MTK_QDMA_INT_MASK); + val = mtk_r32(eth, reg); + mtk_w32(eth, val | mask, reg); spin_unlock_irqrestore(ð->irq_lock, flags); } @@ -897,12 +899,12 @@ release_desc: * we continue */ wmb(); - mtk_w32(eth, ring->calc_idx, MTK_QRX_CRX_IDX0); + mtk_w32(eth, ring->calc_idx, MTK_PRX_CRX_IDX0); done++; } if (done < budget) - mtk_w32(eth, MTK_RX_DONE_INT, MTK_QMTK_INT_STATUS); + mtk_w32(eth, MTK_RX_DONE_INT, MTK_PDMA_INT_STATUS); return done; } @@ -1012,7 +1014,7 @@ static int mtk_napi_tx(struct napi_struct *napi, int budget) return budget; napi_complete(napi); - mtk_irq_enable(eth, MTK_TX_DONE_INT); + mtk_irq_enable(eth, MTK_QDMA_INT_MASK, MTK_TX_DONE_INT); return tx_done; } @@ -1024,12 +1026,12 @@ static int mtk_napi_rx(struct napi_struct *napi, int budget) int rx_done = 0; mtk_handle_status_irq(eth); - mtk_w32(eth, MTK_RX_DONE_INT, MTK_QMTK_INT_STATUS); + mtk_w32(eth, MTK_RX_DONE_INT, MTK_PDMA_INT_STATUS); rx_done = mtk_poll_rx(napi, budget, eth); if (unlikely(netif_msg_intr(eth))) { - status = mtk_r32(eth, MTK_QMTK_INT_STATUS); - mask = mtk_r32(eth, MTK_QDMA_INT_MASK); + status = mtk_r32(eth, MTK_PDMA_INT_STATUS); + mask = mtk_r32(eth, MTK_PDMA_INT_MASK); dev_info(eth->dev, "done rx %d, intr 0x%08x/0x%x\n", rx_done, status, mask); @@ -1038,12 +1040,12 @@ static int mtk_napi_rx(struct napi_struct *napi, int budget) if (rx_done == budget) return budget; - status = mtk_r32(eth, MTK_QMTK_INT_STATUS); + status = mtk_r32(eth, MTK_PDMA_INT_STATUS); if (status & MTK_RX_DONE_INT) return budget; napi_complete(napi); - mtk_irq_enable(eth, MTK_RX_DONE_INT); + mtk_irq_enable(eth, MTK_PDMA_INT_MASK, MTK_RX_DONE_INT); return rx_done; } @@ -1092,6 +1094,7 @@ static int mtk_tx_alloc(struct mtk_eth *eth) mtk_w32(eth, ring->phys + ((MTK_DMA_SIZE - 1) * sz), MTK_QTX_DRX_PTR); + mtk_w32(eth, (QDMA_RES_THRES << 8) | QDMA_RES_THRES, MTK_QTX_CFG(0)); return 0; @@ -1162,11 +1165,10 @@ static int mtk_rx_alloc(struct mtk_eth *eth) */ wmb(); - mtk_w32(eth, eth->rx_ring.phys, MTK_QRX_BASE_PTR0); - mtk_w32(eth, MTK_DMA_SIZE, MTK_QRX_MAX_CNT0); - mtk_w32(eth, eth->rx_ring.calc_idx, MTK_QRX_CRX_IDX0); - mtk_w32(eth, MTK_PST_DRX_IDX0, MTK_QDMA_RST_IDX); - mtk_w32(eth, (QDMA_RES_THRES << 8) | QDMA_RES_THRES, MTK_QTX_CFG(0)); + mtk_w32(eth, eth->rx_ring.phys, MTK_PRX_BASE_PTR0); + mtk_w32(eth, MTK_DMA_SIZE, MTK_PRX_MAX_CNT0); + mtk_w32(eth, eth->rx_ring.calc_idx, MTK_PRX_CRX_IDX0); + mtk_w32(eth, MTK_PST_DRX_IDX0, MTK_PDMA_RST_IDX); return 0; } @@ -1285,7 +1287,7 @@ static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth) if (likely(napi_schedule_prep(ð->rx_napi)
[PATCH net-next v3 0/2] net: ethernet: mediatek: modify to use the PDMA for Ethernet RX
This patch set fixes the following issues v1 -> v2: Fix the bugs of PDMA cpu index and interrupt settings in mtk_poll_rx() v2 -> v3: Add GDM hardware settings to send packets to PDMA for RX Nelson Chang (2): net: ethernet: mediatek: modify to use the PDMA instead of the QDMA for Ethernet RX net: ethernet: mediatek: modify GDM to send packets to the PDMA for RX drivers/net/ethernet/mediatek/mtk_eth_soc.c | 80 + drivers/net/ethernet/mediatek/mtk_eth_soc.h | 31 ++- 2 files changed, 76 insertions(+), 35 deletions(-) -- 1.9.1
[net] i40e: Change some init flow for the client
From: Anjali Singhai Jain This change makes a common flow for Client instance open during init and reset path. The Client subtask can handle both the cases instead of making a separate notify_client_of_open call. Also it may fix a bug during reset where the service task was leaking some memory and causing issues. Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd Signed-off-by: Anjali Singhai Jain Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_client.c | 41 --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 - 2 files changed, 30 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_client.c b/drivers/net/ethernet/intel/i40e/i40e_client.c index e1370c5..618f184 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_client.c +++ b/drivers/net/ethernet/intel/i40e/i40e_client.c @@ -199,6 +199,7 @@ void i40e_notify_client_of_l2_param_changes(struct i40e_vsi *vsi) void i40e_notify_client_of_netdev_open(struct i40e_vsi *vsi) { struct i40e_client_instance *cdev; + int ret = 0; if (!vsi) return; @@ -211,7 +212,14 @@ void i40e_notify_client_of_netdev_open(struct i40e_vsi *vsi) "Cannot locate client instance open routine\n"); continue; } - cdev->client->ops->open(&cdev->lan_info, cdev->client); + if (!(test_bit(__I40E_CLIENT_INSTANCE_OPENED, + &cdev->state))) { + ret = cdev->client->ops->open(&cdev->lan_info, + cdev->client); + if (!ret) + set_bit(__I40E_CLIENT_INSTANCE_OPENED, + &cdev->state); + } } } mutex_unlock(&i40e_client_instance_mutex); @@ -407,12 +415,14 @@ struct i40e_vsi *i40e_vsi_lookup(struct i40e_pf *pf, * i40e_client_add_instance - add a client instance struct to the instance list * @pf: pointer to the board struct * @client: pointer to a client struct in the client list. + * @existing: if there was already an existing instance * - * Returns cdev ptr on success, NULL on failure + * Returns cdev ptr on success or if already exists, NULL on failure **/ static struct i40e_client_instance *i40e_client_add_instance(struct i40e_pf *pf, - struct i40e_client *client) +struct i40e_client *client, +bool *existing) { struct i40e_client_instance *cdev; struct netdev_hw_addr *mac = NULL; @@ -421,7 +431,7 @@ struct i40e_client_instance *i40e_client_add_instance(struct i40e_pf *pf, mutex_lock(&i40e_client_instance_mutex); list_for_each_entry(cdev, &i40e_client_instances, list) { if ((cdev->lan_info.pf == pf) && (cdev->client == client)) { - cdev = NULL; + *existing = true; goto out; } } @@ -505,6 +515,7 @@ void i40e_client_subtask(struct i40e_pf *pf) { struct i40e_client_instance *cdev; struct i40e_client *client; + bool existing = false; int ret = 0; if (!(pf->flags & I40E_FLAG_SERVICE_CLIENT_REQUESTED)) @@ -528,18 +539,25 @@ void i40e_client_subtask(struct i40e_pf *pf) /* check if L2 VSI is up, if not we are not ready */ if (test_bit(__I40E_DOWN, &pf->vsi[pf->lan_vsi]->state)) continue; + } else { + dev_warn(&pf->pdev->dev, "This client %s is being instanciated at probe\n", +client->name); } /* Add the client instance to the instance list */ - cdev = i40e_client_add_instance(pf, client); + cdev = i40e_client_add_instance(pf, client, &existing); if (!cdev) continue; - /* Also up the ref_cnt of no. of instances of this client */ - atomic_inc(&client->ref_cnt); - dev_info(&pf->pdev->dev, "Added instance of Client %s to PF%d bus=0x%02x func=0x%02x\n", -client->name, pf->hw.pf_id, -pf->hw.bus.device, pf->hw.bus.func); + if (!existing) { + /* Also up the ref_cnt for no. of instances of this +* client. +*/ + atomic_inc(&client->ref_cnt); + dev_info(&pf->pdev->dev, "Added instance of Client %s to PF%d bus=0x%02x func=0x%02x\n", +
Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO
On Wed, Aug 24, 2016 at 11:53 AM, David Ahern wrote: > On 8/24/16 11:41 AM, pravin shelar wrote: >> You also need to change pop_mpls(). > > What change is needed in pop_mpls? It already resets the mac_header and if > MPLS labels are removed there is no need to set network_header. I take it you > mean if the protocol is still MPLS and there are still labels then the > network header needs to be set and that means finding the bottom label. Does > OVS set the bottom of stack bit? From what I can tell OVS is not parsing the > MPLS label so no requirement that BOS is set. Without that there is no way to > tell when the labels are done short of guessing. > OVS mpls push and pop action works on outer most mpls label. So according to new mpls offsets tracking scheme on mpls_pop action you need to adjust skb network offset.
Re: [PATCH net-next 1/6] net: dsa: b53: Initialize ds->drv in b53_switch_alloc
Le 24/08/2016 à 18:33, Florian Fainelli a écrit : > In order to alloc drivers to override specific dsa_switch_driver > callbacks, initialize ds->drv to b53_switch_ops earlier, which avoids > having to expose this structure to glue drivers. > > Signed-off-by: Florian Fainelli This will need some refactoring after Vivien's "net: dsa: rename switch operations structure" patch. > --- > drivers/net/dsa/b53/b53_common.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/net/dsa/b53/b53_common.c > b/drivers/net/dsa/b53/b53_common.c > index 65ecb51f99e5..30377ceb1928 100644 > --- a/drivers/net/dsa/b53/b53_common.c > +++ b/drivers/net/dsa/b53/b53_common.c > @@ -1602,7 +1602,6 @@ static const struct b53_chip_data b53_switch_chips[] = { > > static int b53_switch_init(struct b53_device *dev) > { > - struct dsa_switch *ds = dev->ds; > unsigned int i; > int ret; > > @@ -1618,7 +1617,6 @@ static int b53_switch_init(struct b53_device *dev) > dev->vta_regs[1] = chip->vta_regs[1]; > dev->vta_regs[2] = chip->vta_regs[2]; > dev->jumbo_pm_reg = chip->jumbo_pm_reg; > - ds->drv = &b53_switch_ops; > dev->cpu_port = chip->cpu_port; > dev->num_vlans = chip->vlans; > dev->num_arl_entries = chip->arl_entries; > @@ -1706,6 +1704,7 @@ struct b53_device *b53_switch_alloc(struct device *base, > dev->ds = ds; > dev->priv = priv; > dev->ops = ops; > + ds->drv = &b53_switch_ops; > mutex_init(&dev->reg_mutex); > mutex_init(&dev->stats_mutex); > > -- Florian
Re: kernel BUG at net/unix/garbage.c:149!"
On Thu, Aug 25, 2016 at 12:40 AM, Hannes Frederic Sowa wrote: > On 24.08.2016 16:24, Nikolay Borisov wrote: [SNIP] > > One commit which could have to do with that is > > commit fc64869c48494a401b1fb627c9ecc4e6c1d74b0d > Author: Andrey Ryabinin > Date: Wed May 18 19:19:27 2016 +0300 > > net: sock: move ->sk_shutdown out of bitfields. > > but that is only a wild guess. > > Which unix_sock did you extract specifically in the url you provided? In > unix_notinflight we are specifically checking an unix domain socket that > is itself being transferred over another af_unix domain socket and not > the unix domain socket being released at this point. So this is the state of the socket that is being passed to unix_notinflight. I have a complete crashdump so if you need more info to diagnose it I'm happy to provide it. I'm not too familiar with the code in question so I will need a bit of time to grasp what actually is happening. > > Can you reproduce this and maybe also with a newer kernel? Unfortunately I cannot reproduce this since it happened on a production server nor can I change the kernel. But clearly there is something wrong, and given that this is a stable kernel and no relevant changes have gone in latest stable I believe the problem (albeit hardly reproducible) would still persist. > > Thanks for the report, > Hannes >
[PATCH v2] net: macb: Increase DMA TX buffer size
From: Nathan Sullivan In recent testing with the RT patchset, we have seen cases where the transmit ring can fill even with up to 200 txbds in the ring. Increase the size of the DMA TX ring to avoid overruns. Signed-off-by: Xander Huff Signed-off-by: Nathan Sullivan --- drivers/net/ethernet/cadence/macb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c index 3256839..3efddb7 100644 --- a/drivers/net/ethernet/cadence/macb.c +++ b/drivers/net/ethernet/cadence/macb.c @@ -40,7 +40,7 @@ #define RX_RING_SIZE 512 /* must be power of 2 */ #define RX_RING_BYTES (sizeof(struct macb_dma_desc) * RX_RING_SIZE) -#define TX_RING_SIZE 128 /* must be power of 2 */ +#define TX_RING_SIZE 512 /* must be power of 2 */ #define TX_RING_BYTES (sizeof(struct macb_dma_desc) * TX_RING_SIZE) /* level of occupied TX descriptors under which we wake up TX process */ -- 1.9.1
Re: [PATCH v2 2/6] cgroup: add support for eBPF programs
Hi Tejun, On 08/24/2016 11:54 PM, Tejun Heo wrote: > On Wed, Aug 24, 2016 at 10:24:19PM +0200, Daniel Mack wrote: >> +void cgroup_bpf_free(struct cgroup *cgrp) >> +{ >> +unsigned int type; >> + >> +rcu_read_lock(); >> + >> +for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) { >> +if (!cgrp->bpf.prog[type]) >> +continue; >> + >> +bpf_prog_put(cgrp->bpf.prog[type]); >> +static_branch_dec(&cgroup_bpf_enabled_key); >> +} >> + >> +rcu_read_unlock(); > > These rcu locking seem suspicious to me. RCU locking on writer side > is usually bogus. We sometimes do it to work around locking > assertions in accessors but it's a better idea to make the assertions > better in those cases - e.g. sth like assert_mylock_or_rcu_locked(). Right, in this case, it is unnecessary, as the bpf.prog[] is not under RCU. >> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent) >> +{ >> +unsigned int type; >> + >> +rcu_read_lock(); > > Ditto. > >> +for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) >> +rcu_assign_pointer(cgrp->bpf.prog_effective[type], >> +rcu_dereference(parent->bpf.prog_effective[type])); Okay, yes. We're under cgroup_mutex write-path protection here, so that's unnecessary too. >> +void __cgroup_bpf_update(struct cgroup *cgrp, >> + struct cgroup *parent, >> + struct bpf_prog *prog, >> + enum bpf_attach_type type) >> +{ >> +struct bpf_prog *old_prog, *effective; >> +struct cgroup_subsys_state *pos; >> + >> +rcu_read_lock(); > > Ditto. Yes, agreed, as above. >> +old_prog = xchg(cgrp->bpf.prog + type, prog); >> +if (old_prog) { >> +bpf_prog_put(old_prog); >> +static_branch_dec(&cgroup_bpf_enabled_key); >> +} >> + >> +if (prog) >> +static_branch_inc(&cgroup_bpf_enabled_key); > > Minor but probably better to inc first and then dec so that you can > avoid unnecessary enabled -> disabled -> enabled sequence. Good point. Will fix. >> +rcu_read_unlock(); >> + >> +css_for_each_descendant_pre(pos, &cgrp->self) { > > On the other hand, this walk actually requires rcu read locking unless > you're holding cgroup_mutex. I am - this function is always called with cgroup_mutex held through the wrapper in kernel/cgroup.c. Thanks a lot - will put all that changes in v3. Daniel
Continue a discussion about the netlink interface
Hello, I want to return to a discussion about the netlink interface and how to use it out of the network subsystem. I'm developing a new interface to get information about processes (task_diag). task_diag is like socket_diag but for processes. [0] In the first two versions [1] [2], I used the netlink interface to communicate with kernel. There was a discussion [4], that the netlink interface is not suitable for this task and it has a few known issues about security, so probably it should not be used for task_diag. Then, in a third version [3], I used a proc transaction file instead of the netlink interface. But it was not accepted too, because we already have the netlink interface[5] and it's a bad idea to add one more similar less-generic interface. Then Andy Lutomirski suggested to rework netlink [6], but nobody answered on his suggestion. Can we continue this discussion and find a final solution? Maybe we need to schedule a face-to-face meeting on one of conferences? It may be Linux Plumbers, for example. Here is Andy's idea how the netlink interface can be reworked: On Wed, May 04, 2016 at 08:39:51PM -0700, Andy Lutomirski wrote: > Netlink had, and possibly still has, tons of serious security bugs > involving code checking send() callers' creds. I found and fixed a > few a couple years ago. To reiterate once again, send() CANNOT use > caller creds safely. (I feel like I say this once every few weeks. > It's getting old.) > > I realize that it's convenient to use a socket as a context to keep > state between syscalls, but it has some annoying side effects: > > - It makes people want to rely on send()'s caller's creds. > > - It's miserable in combination with seccomp. > > - It doesn't play nicely with namespaces. > > - It makes me wonder why things like task_diag, which have nothing to > do with networking, seem to get tangled up with networking. > > > Would it be worth considering adding a parallel interface, using it > for new things, and slowly migrating old use cases over? > > int issue_kernel_command(int ns, int command, const struct iovec *iov, > int iovcnt, int flags); > > ns is an actual namespace fd or: > > KERNEL_COMMAND_CURRENT_NETNS > KERNEL_COMMAND_CURRENT_PIDNS > etc, or a special one: > KERNEL_COMMAND_GLOBAL. KERNEL_COMMAND_GLOBAL can't be used in a > non-root namespace. > > KERNEL_COMMAND_GLOBAL works even for namespaced things, if the > relevant current ns is the init namespace. (This feature is optional, > but it would allow gradually namespacing global things.) > command is an enumerated command. Each command implies a namespace > type, and, if you feed this thing the wrong namespace type, you get > EINVAL. The high bit of command indicates whether it's read-only > command. > > iov gives a command in the format expected, which, for the most part, > would be a netlink message. > > The return value is an fd that you can call read/readv on to read the > response. It's not a socket (or at least you can't do normal socket > operations on it if it is a socket behind the scenes). The > implementation of read() promises *not* to look at caller creds. The > returned fd is unconditionally cloexec -- it's 2016 already. Sheesh. > > When you've read all the data, all you can do is close the fd. You > can't issue another command on the same fd. You also can't call > write() or send() on the fd unless someone has a good reason why you > should be able to and why it's safe. You can't issue another command > on the same fd. > > > I imagine that the implementation could re-use a bunch of netlink code > under the hood. [6] https://www.mail-archive.com/netdev@vger.kernel.org/msg109212.html [5] https://lkml.org/lkml/2016/5/4/785 [4] https://lkml.org/lkml/2015/7/6/708 [3] https://lwn.net/Articles/683371/ [2] https://lkml.org/lkml/2015/7/6/142 [1] https://lwn.net/Articles/633622/ [0] https://criu.org/Task-diag Thanks, Andrei
Re: [PATCH v2 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
Hello, On Wed, Aug 24, 2016 at 10:24:20PM +0200, Daniel Mack wrote: > SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, > size) > { > union bpf_attr attr = {}; > @@ -888,6 +957,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, > uattr, unsigned int, siz > case BPF_OBJ_GET: > err = bpf_obj_get(&attr); > break; > + > +#ifdef CONFIG_CGROUP_BPF > + case BPF_PROG_ATTACH: > + err = bpf_prog_attach(&attr); > + break; > + case BPF_PROG_DETACH: > + err = bpf_prog_detach(&attr); > + break; > +#endif So, this is one thing I haven't realized while pushing for "just embed it in cgroup". Breaking it out to a separate controller allows using its own locking instead of having to piggyback on cgroup_mutex. That said, as long as cgroup_mutex is not nested inside some inner mutex, this shouldn't be a problem. I still think the embedding is fine and whether we make it an implicit controller or not doesn't affect userland API at all, so it's an implementation detail that we can change later if necessary. Thanks. -- tejun
[PATCH v2] Revert "phy: IRQ cannot be shared"
This reverts: commit 33c133cc7598 ("phy: IRQ cannot be shared") On hardware with multiple PHY devices hooked up to the same IRQ line, allow them to share it. Sergei Shtylyov says: "I'm not sure now what was the reason I concluded that the IRQ sharing was impossible... most probably I thought that the kernel IRQ handling code exited the loop over the IRQ actions once IRQ_HANDLED was returned -- which is obviously not so in reality..." Signed-off-by: Xander Huff Signed-off-by: Nathan Sullivan --- Note: this reverted code fails "CHECK: Alignment should match open parentesis" --- drivers/net/phy/phy.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index c5dc2c36..c6f6683 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -722,8 +722,10 @@ phy_err: int phy_start_interrupts(struct phy_device *phydev) { atomic_set(&phydev->irq_disable, 0); - if (request_irq(phydev->irq, phy_interrupt, 0, "phy_interrupt", - phydev) < 0) { + if (request_irq(phydev->irq, phy_interrupt, + IRQF_SHARED, + "phy_interrupt", + phydev) < 0) { pr_warn("%s: Can't get IRQ %d (PHY)\n", phydev->mdio.bus->name, phydev->irq); phydev->irq = PHY_POLL; -- 1.9.1
Re: [PATCH] phy: request shared IRQ
On 8/24/2016 1:41 PM, Sergei Shtylyov wrote: Hello. On 08/24/2016 08:53 PM, Xander Huff wrote: From: Nathan Sullivan On hardware with multiple PHY devices hooked up to the same IRQ line, allow them to share it. Note that it had been allowed until my (erroneous?) commit 33c133cc7598e60976a069344910d63e56cc4401 ("phy: IRQ cannot be shared"), so I'd like this commit just reverted instead... I'm not sure now what was the reason I concluded that the IRQ sharing was impossible... most probably I thought that the kernel IRQ handling code exited the loop over the IRQ actions once IRQ_HANDLED was returned -- which is obviously not so in reality... MBR, Sergei Thanks for the suggestion, Sergei. I'll do just that. -- Xander Huff Staff Software Engineer National Instruments
Re: [PATCH v2 2/6] cgroup: add support for eBPF programs
Hello, Daniel. On Wed, Aug 24, 2016 at 10:24:19PM +0200, Daniel Mack wrote: > +void cgroup_bpf_free(struct cgroup *cgrp) > +{ > + unsigned int type; > + > + rcu_read_lock(); > + > + for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) { > + if (!cgrp->bpf.prog[type]) > + continue; > + > + bpf_prog_put(cgrp->bpf.prog[type]); > + static_branch_dec(&cgroup_bpf_enabled_key); > + } > + > + rcu_read_unlock(); These rcu locking seem suspicious to me. RCU locking on writer side is usually bogus. We sometimes do it to work around locking assertions in accessors but it's a better idea to make the assertions better in those cases - e.g. sth like assert_mylock_or_rcu_locked(). > +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent) > +{ > + unsigned int type; > + > + rcu_read_lock(); Ditto. > + for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) > + rcu_assign_pointer(cgrp->bpf.prog_effective[type], > + rcu_dereference(parent->bpf.prog_effective[type])); > + > + rcu_read_unlock(); > +} ... > +void __cgroup_bpf_update(struct cgroup *cgrp, > + struct cgroup *parent, > + struct bpf_prog *prog, > + enum bpf_attach_type type) > +{ > + struct bpf_prog *old_prog, *effective; > + struct cgroup_subsys_state *pos; > + > + rcu_read_lock(); Ditto. > + old_prog = xchg(cgrp->bpf.prog + type, prog); > + if (old_prog) { > + bpf_prog_put(old_prog); > + static_branch_dec(&cgroup_bpf_enabled_key); > + } > + > + if (prog) > + static_branch_inc(&cgroup_bpf_enabled_key); Minor but probably better to inc first and then dec so that you can avoid unnecessary enabled -> disabled -> enabled sequence. > + effective = (!prog && parent) ? > + rcu_dereference(parent->bpf.prog_effective[type]) : prog; If this is what's triggering rcu warnings, there's an accessor to use in these situations. > + rcu_read_unlock(); > + > + css_for_each_descendant_pre(pos, &cgrp->self) { On the other hand, this walk actually requires rcu read locking unless you're holding cgroup_mutex. Thanks. -- tejun
Re: kernel BUG at net/unix/garbage.c:149!"
On 24.08.2016 16:24, Nikolay Borisov wrote: > Hello, > > I hit the following BUG: > > [1851513.239831] [ cut here ] > [1851513.240079] kernel BUG at net/unix/garbage.c:149! > [1851513.240313] invalid opcode: [#1] SMP > [1851513.248320] CPU: 37 PID: 11683 Comm: nginx Tainted: G O > 4.4.14-clouder3 #26 > [1851513.248719] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 > [1851513.248966] task: 883b0f6f ti: 880189cf task.ti: > 880189cf > [1851513.249361] RIP: 0010:[] [] > unix_notinflight+0x8d/0x90 > [1851513.249846] RSP: 0018:880189cf3cf8 EFLAGS: 00010246 > [1851513.250082] RAX: 883b05491968 RBX: 883b05491680 RCX: > 8807f9967330 > [1851513.250476] RDX: 0001 RSI: 882e6d8bae00 RDI: > 82073f10 > [1851513.250886] RBP: 880189cf3d08 R08: 880cbc70e200 R09: > 00018021 > [1851513.251280] R10: 883fff3b9dc0 R11: ea0032f1c380 R12: > 883fbaf5 > [1851513.251674] R13: 815f6354 R14: 881a7c77b140 R15: > 881a7c7792c0 > [1851513.252083] FS: 7f4f19573720() GS:883fff3a() > knlGS: > [1851513.252481] CS: 0010 DS: ES: CR0: 80050033 > [1851513.252717] CR2: 013062d8 CR3: 001712f32000 CR4: > 001406e0 > [1851513.253116] Stack: > [1851513.253345] 880189cf3d40 880189cf3d28 > 815f4383 > [1851513.254022] 8839ee11a800 8839ee11a800 880189cf3d60 > 815f53b8 > [1851513.254685] 883406788de0 > > [1851513.255360] Call Trace: > [1851513.255594] [] unix_detach_fds.isra.19+0x43/0x50 > [1851513.255851] [] unix_destruct_scm+0x48/0x80 > [1851513.256090] [] skb_release_head_state+0x4f/0xb0 > [1851513.256328] [] skb_release_all+0x12/0x30 > [1851513.256564] [] kfree_skb+0x32/0xa0 > [1851513.256810] [] unix_release_sock+0x1e4/0x2c0 > [1851513.257046] [] unix_release+0x20/0x30 > [1851513.257284] [] sock_release+0x1f/0x80 > [1851513.257521] [] sock_close+0x12/0x20 > [1851513.257769] [] __fput+0xea/0x1f0 > [1851513.258005] [] fput+0xe/0x10 > [1851513.258244] [] task_work_run+0x7f/0xb0 > [1851513.258488] [] exit_to_usermode_loop+0xc0/0xd0 > [1851513.258728] [] syscall_return_slowpath+0x80/0xf0 > [1851513.258983] [] int_ret_from_sys_call+0x25/0x9f > [1851513.259222] Code: 7e 5b 41 5c 5d c3 48 8b 8b e8 02 00 00 48 8b 93 f0 02 > 00 00 48 89 51 08 48 89 0a 48 89 83 e8 02 00 00 48 89 83 f0 02 00 00 eb b8 > <0f> 0b 90 0f 1f 44 00 00 55 48 c7 c7 10 3f 07 82 48 89 e5 41 54 > [1851513.268473] RIP [] unix_notinflight+0x8d/0x90 > [1851513.268793] RSP > > That's essentially BUG_ON(list_empty(&u->link)); > > I see that all the code involving the ->link member hasn't really been > touched since it was introduced in 2007. So this must be a latent bug. > This is the first time I've observed it. The state > of the struct unix_sock can be found here http://sprunge.us/WCMW . Evidently, > there are no inflight sockets. One commit which could have to do with that is commit fc64869c48494a401b1fb627c9ecc4e6c1d74b0d Author: Andrey Ryabinin Date: Wed May 18 19:19:27 2016 +0300 net: sock: move ->sk_shutdown out of bitfields. but that is only a wild guess. Which unix_sock did you extract specifically in the url you provided? In unix_notinflight we are specifically checking an unix domain socket that is itself being transferred over another af_unix domain socket and not the unix domain socket being released at this point. Can you reproduce this and maybe also with a newer kernel? Thanks for the report, Hannes
[PATCH] net: systemport: Fix ordering in intrl2_*_mask_clear macro
Since we keep shadow copies of which interrupt sources are enabled through the intrl2_*_mask_{set,clear} macros, make sure that the ordering in which we do these two operations: update the copy, then unmask the register is correct. This is not currently a problem because we actually do not use them, but we will in a subsequent patch optimizing register accesses, so better be safe here. Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver") Signed-off-by: Florian Fainelli --- David, This is intentionally targetting the "net-next" tree since it is not yet a problem, yet this is still technically a bugfix. No need to backport this to -stable or anything. Thanks! drivers/net/ethernet/broadcom/bcmsysport.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index b2d30863caeb..2059911014db 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -58,8 +58,8 @@ BCM_SYSPORT_IO_MACRO(topctrl, SYS_PORT_TOPCTRL_OFFSET); static inline void intrl2_##which##_mask_clear(struct bcm_sysport_priv *priv, \ u32 mask) \ { \ - intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \ priv->irq##which##_mask &= ~(mask); \ + intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \ } \ static inline void intrl2_##which##_mask_set(struct bcm_sysport_priv *priv, \ u32 mask) \ -- 2.7.4
[PATCH V2] dt: net: enhance DWC EQoS binding to support Tegra186
From: Stephen Warren The Synopsys DWC EQoS is a configurable IP block which supports multiple options for bus type, clocking and reset structure, and feature list. Extend the DT binding to define a "compatible value" for the configuration contained in NVIDIA's Tegra186 SoC, and define some new properties and list property entries required by that configuration. Signed-off-by: Stephen Warren --- v2: * Add an explicit compatible value for the Axis SoC's version of the EQOS IP; this allows the driver to handle any SoC-specific integration quirks that are required, rather than only knowing about the IP block in isolation. This is good general DT practice. The existing value is still documented to support existing DTs. * Reworked the list of clocks the binding requires: - Combined "tx" and "phy_ref_clk"; for GMII/RGMII configurations, these are the same thing. - Added extra description to the "rx" and "tx" clocks, to make it clear exactly which HW clock they represent. - Made the new "tx" and "slave_bus" names more prominent than the original "phy_ref_clk" and "apb_pclk". The new names are more generic and should work for any enhanced version of the binding (e.g. to support additional PHY types). New compatible values will hopefully choose to require the new names. * Added a couple extra clocks to the list that may need to be supported in future binding revisions. * Fixed a typo; "clocks" -> "resets". --- .../bindings/net/snps,dwc-qos-ethernet.txt | 75 -- 1 file changed, 71 insertions(+), 4 deletions(-) diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt index 51f8d2eba8d8..1d028259824a 100644 --- a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt +++ b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt @@ -1,21 +1,87 @@ * Synopsys DWC Ethernet QoS IP version 4.10 driver (GMAC) +This binding supports the Synopsys Designware Ethernet QoS (Quality Of Service) +IP block. The IP supports multiple options for bus type, clocking and reset +structure, and feature list. Consequently, a number of properties and list +entries in properties are marked as optional, or only required in specific HW +configurations. Required properties: -- compatible: Should be "snps,dwc-qos-ethernet-4.10" +- compatible: One of: + - "axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10" +Represents the IP core when integrated into the Axis ARTPEC-6 SoC. + - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10" +Represents the IP core when integrated into the NVIDIA Tegra186 SoC. + - "snps,dwc-qos-ethernet-4.10" +This combination is deprecated. It should be treated as equivalent to +"axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10". It is supported to be +compatible with earlier revisions of this binding. - reg: Address and length of the register set for the device -- clocks: Phandles to the reference clock and the bus clock -- clock-names: Should be "phy_ref_clk" for the reference clock and "apb_pclk" - for the bus clock. +- clocks: Phandle and clock specifiers for each entry in clock-names, in the + same order. See ../clock/clock-bindings.txt. +- clock-names: May contain any/all of the following depending on the IP + configuration, in any order: + - "tx" +(Alternate name "phy_ref_clk"; only one alternate must appear.) +The EQOS transmit path clock. The HW signal name is clk_tx_i. +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY TX +path. In other configurations, other clocks (such as tx_125, rmii) may +drive the PHY TX path. + - "rx" +The EQOS receive path clock. The HW signal name is clk_rx_i. +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY RX +path. In other configurations, other clocks (such as rx_125, pmarx_0, +pmarx_1, rmii) may drive the PHY RX path. + - "slave_bus" +(Alternate name "apb_pclk"; only one alternate must appear.) +The CPU/slave-bus (CSR) interface clock. Despite the name, this applies to +any bus type; APB, AHB, AXI, etc. The HW signal name is hclk_i (AHB) or +clk_csr_i (other buses). + - "master_bus" +The master bus interface clock. Only required in configurations that use a +separate clock for the master and slave bus interfaces. The HW signal name +is hclk_i (AHB) or aclk_i (AXI). + - "ptp_ref" +The PTP reference clock. The HW signal name is clk_ptp_ref_i. + + Note: Support for additional IP configurations may require adding the + following clocks to this list in the future: clk_rx_125_i, clk_tx_125_i, + clk_pmarx_0_i, clk_pmarx1_i, clk_rmii_i, clk_revmii_rx_i, clk_revmii_tx_i. + + The following compatible values require the following set of clocks: + - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10": +- "slave_bus" +- "master_bus" +- "rx" +- "tx" +-
Re: [PATCH net-next] net: dsa: rename switch operations structure
On 08/23/2016 09:38 AM, Vivien Didelot wrote: > Now that the dsa_switch_driver structure contains only function pointers > as it is supposed to, rename it to the more appropriate dsa_switch_ops, > uniformly to any other operations structure in the kernel. > > No functional changes here, basically just the result of something like: > s/dsa_switch_driver *drv/dsa_switch_ops *ops/g > > However keep the {un,}register_switch_driver functions and their > dsa_switch_drivers list as is, since they represent the -- likely to be > deprecated soon -- legacy DSA registration framework. > > In the meantime, also fix the following checks from checkpatch.pl to > make it happy with this patch: > > CHECK: Comparison to NULL could be written "!ops" > #403: FILE: net/dsa/dsa.c:470: > + if (ops == NULL) { > > CHECK: Comparison to NULL could be written "ds->ops->get_strings" > #773: FILE: net/dsa/slave.c:697: > + if (ds->ops->get_strings != NULL) > > CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats" > #824: FILE: net/dsa/slave.c:785: > + if (ds->ops->get_ethtool_stats != NULL) > > CHECK: Comparison to NULL could be written "ds->ops->get_sset_count" > #835: FILE: net/dsa/slave.c:798: > + if (ds->ops->get_sset_count != NULL) > > total: 0 errors, 0 warnings, 4 checks, 784 lines checked > > Signed-off-by: Vivien Didelot Acked-by: Florian Fainelli Thanks! -- Florian
Re: [PATCH 0/5] Networking cgroup controller
Hello, Anoop. On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote: > This patchset introduces a cgroup controller for the networking subsystem as a > whole. As of now, this controller will be used for: > > * Limiting the specific ports that a process in a cgroup is allowed to bind > to or listen on. For example, you can say that all the processes in a > cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, > which > guarantees that the remaining ports will be available for other processes. > > * Restricting which DSCP values processes can use with their sockets. For > example, you can say that all the processes in a cgroup can only send > packets with a DSCP tag between 48 and 63 (corresponding to TOS values of > 192 to 255). > > * Limiting the total number of udp ports that can be used by a process in a > cgroup. For example, you can say that all the processes in one cgroup are > allowed to use a total of up to 100 udp ports. Since the total number of udp > ports that can be used by all processes is limited, this is useful for > rationing out the ports to different process groups. > > In the future, more networking-related properties may be added to this > controller. Thanks for working on this; however, I share the sentiment expressed by others that this looks like too piecemeal an approach. If there are no alternatives, we surely should consider this but it at least *looks* like bpf should be able to cover the same functionalities without having to revise and extend in-kernel capabilities constantly. Thanks. -- tejun
Re: [net] openvswitch: Allow deferred action fifo to expand during run time
> From: "David Miller" > To: az...@ovn.org > Cc: d...@openvswitch.com, netdev@vger.kernel.org > Sent: Friday, March 18, 2016 5:19:09 PM > Subject: Re: [net] openvswitch: Allow deferred action fifo to expand during > run time > > From: Andy Zhou > Date: Thu, 17 Mar 2016 21:32:13 -0700 > > > Current openvswitch implementation allows up to 10 recirculation actions > > for each packet. This limit was sufficient for most use cases in the > > past, but with more new features, such as supporting connection > > tracking, and testing in larger scale network environment, > > This limit may be too restrictive. > ... > > Actions that need to recirculate that many times are extremely poorly > designed, and will have significant performance problems. > > I think the way rules are put together and processed should be redone > before we do insane stuff like this. > > There is no way I'm applying a patch like this, sorry. > Apologies for coming into this thread so late, I happened on it after finding out that this is actually an issue in some production networks. The need to buffer so many deferred actions seems to be mostly due to having relatively simple rules (that have, say, one or two recirculations) that get multiplied per packet by the number of egress ports. For example, a configuration with 11 or more OVS bond ports in balance-tcp mode (which needs one recirculation) will exceed the deferred action fifo limit of 10 every time a broadcast (or multicast or unknown unicast) is forwarded by the OVS bridge because one entry will be consumed by each egress port. Since the order in which egress ports are handled is deterministic, this means e.g. broadcast ARP requests will only ever make it out the first 10 bond ports in this scenario. Note that bonding isn't necessary to have this issue, it just makes for a relatively straightforward example. Andy's patch certainly seems to be an improvement on this situation, but maybe there another/better way. Regards, Lance
Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds
On Wed, Aug 24, 2016 at 10:33:04AM -0400, John W. Linville wrote: > On Wed, Aug 24, 2016 at 04:29:22AM +, Yuval Mintz wrote: > > > This patch series provides following support > > > a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/ > > >Encoding/Connector types which are common across SFP/SFP+ (SFF-8472) > > >and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files. > > > b) Support for diagnostics information for QSFP Plus/QSFP28 modules > > >based on SFF-8436/SFF-8636 > > > c) Supporting 25G/50G/100G speeds in supported/advertising fields > > > d) Tested across various QSFP+/QSFP28 Copper/Optical modules > > > > > > Standards for QSFP+/QSFP28 > > > a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016 > > > b) SFF-8024 Rev 4.0 dated May 31, 2016 > > > > > > v4: > > > Sync ethtool-copy.h to kernel commit > > > 89da45b8b5b2187734a11038b8593714f964ffd1 > > > which includes support for 50G base SR2 > > > > What about the man-page? > > I can just apply your man page patch on top. And, I did. -- John W. LinvilleSomeday the world will need a hero, and you linvi...@tuxdriver.com might be all we have. Be ready.
Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds
I have pushed this series. I did modify patches 3 and 4 a bit, to properly update Makefile.am in order to keep "make distcheck" from failing -- please be more careful in the future. John P.S. I have not yet tagged this as an official release, so please test! On Tue, Aug 23, 2016 at 06:30:29AM -0700, Vidya Sagar Ravipati wrote: > From: Vidya Sagar Ravipati > > This patch seryies provides following support > a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/ >Encoding/Connector types which are common across SFP/SFP+ (SFF-8472) >and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files. > b) Support for diagnostics information for QSFP Plus/QSFP28 modules >based on SFF-8436/SFF-8636 > c) Supporting 25G/50G/100G speeds in supported/advertising fields > d) Tested across various QSFP+/QSFP28 Copper/Optical modules > > Standards for QSFP+/QSFP28 > a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016 > b) SFF-8024 Rev 4.0 dated May 31, 2016 > > v4: > Sync ethtool-copy.h to kernel commit > 89da45b8b5b2187734a11038b8593714f964ffd1 > which includes support for 50G base SR2 > > v3: > Review comments from Ben Hutchings: >Make sff diags structure common across sfpdiag.c and >qsfp.c and use common function to print common threshold >values. > Review comments from Rami Rosen: >Cleanup description messages. > > v2: > Included support for 25G/50G/100G speeds in supported/ > advertised speed modes > Review comments from Ben Hutchings: > Split the sff-8024 reorganzing patch and QSFP+/QSFP28 > patch > Fixed all checkpatch warnings (except couple of over 80 character) > > v1: > Support for SFF-8636 Rev 2.7 > Review comments from Ben Hutchings: >Updating copyright holders information for QSFP >Reusing the common functions and macros across sfpid and qsfp > > Vidya Sagar Ravipati (4): > ethtool-copy.h:sync with net > ethtool:Reorganizing SFF-8024 fields for SFP/QSFP > ethtool:QSFP Plus/QSFP28 Diagnostics Information Support > ethtool: Enhancing link mode bits to support 25G/50G/100G > > Makefile.am| 2 +- > ethtool-copy.h | 18 +- > ethtool.c | 35 +++ > internal.h | 3 + > qsfp.c | 788 > + > qsfp.h | 595 +++ > sff-common.c | 304 ++ > sff-common.h | 189 ++ > sfpdiag.c | 105 +--- > sfpid.c| 103 +--- > 10 files changed, 1945 insertions(+), 197 deletions(-) > create mode 100644 qsfp.c > create mode 100644 qsfp.h > create mode 100644 sff-common.c > create mode 100644 sff-common.h > > -- > 2.1.4 > > -- John W. LinvilleSomeday the world will need a hero, and you linvi...@tuxdriver.com might be all we have. Be ready.
Re: [PATCH] dt: net: enhance DWC EQoS binding to support Tegra186
On 08/24/2016 02:10 AM, Lars Persson wrote: On 08/23/2016 10:47 PM, Stephen Warren wrote: The Synopsys DWC EQoS is a configurable IP block which supports multiple options for bus type, clocking and reset structure, and feature list. Extend the DT binding to define a "compatible value" for the configuration contained in NVIDIA's Tegra186 SoC, and define some new properties and list property entries required by that configuration. diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt Optional properties: +- phy-reset-gpios: Phandle and specifier for any GPIO used to reset the PHY. + See ../gpio/gpio.txt. IMHO the phy reset gpio belongs in the binding for the PHY. I notice some other ethernet drivers have this, but the PHY should be managed entirely through the phylib and any special handling for reset can be hidden in phy specific drivers. I can see that argument; this GPIO certainly does control the PHY so seems part of it. However, presumably this GPIO must be manipulated before being able to communicate with the PHY at all, and hence instantiate any driver that might control the PHY. As such, this seems more like a property of the MDIO bus than the PHY itself, even if it electrically is part of the PHY. Also, Documentation/devicetree/bindings/net/phy.txt doesn't contain any phy-reset-gpios property or similar, so we'd have to add that if we wanted to rely upon it. For now I'll post V2 without changing this, but I can always post V3 if needed.
[PATCH] iproute: disallow ip rule del without parameters
Disallow run `ip rule del` without any parameter to avoid delete any first rule from table. Signed-off-by: Andrey Jr. Melnikov --- diff --git a/ip/iprule.c b/ip/iprule.c index 8f24206..70562c5 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -346,6 +346,11 @@ static int iprule_modify(int cmd, int argc, char **argv) req.r.rtm_type = RTN_UNICAST; } + if (cmd == RTM_DELRULE && argc == 0) { + fprintf(stderr, "\"ip rule del\" requires arguments.\n"); + return -1; + } + while (argc > 0) { if (strcmp(*argv, "not") == 0) { req.r.rtm_flags |= FIB_RULE_INVERT;
Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds
On Wed, Aug 24, 2016 at 1:01 PM, John W. Linville wrote: > I have pushed this series. I did modify patches 3 and 4 a bit, > to properly update Makefile.am in order to keep "make distcheck" > from failing -- please be more careful in the future. > Thanks for pushing the patches. Not aware of "make distcheck" and will be careful going forward. Quickly validated the build on SFP+/QSFP+/QSFP28 and everything seems fine > John > > P.S. I have not yet tagged this as an official release, so please test! > > On Tue, Aug 23, 2016 at 06:30:29AM -0700, Vidya Sagar Ravipati wrote: >> From: Vidya Sagar Ravipati >> >> This patch seryies provides following support >> a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/ >>Encoding/Connector types which are common across SFP/SFP+ (SFF-8472) >>and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files. >> b) Support for diagnostics information for QSFP Plus/QSFP28 modules >>based on SFF-8436/SFF-8636 >> c) Supporting 25G/50G/100G speeds in supported/advertising fields >> d) Tested across various QSFP+/QSFP28 Copper/Optical modules >> >> Standards for QSFP+/QSFP28 >> a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016 >> b) SFF-8024 Rev 4.0 dated May 31, 2016 >> >> v4: >> Sync ethtool-copy.h to kernel commit >> 89da45b8b5b2187734a11038b8593714f964ffd1 >> which includes support for 50G base SR2 >> >> v3: >> Review comments from Ben Hutchings: >>Make sff diags structure common across sfpdiag.c and >>qsfp.c and use common function to print common threshold >>values. >> Review comments from Rami Rosen: >>Cleanup description messages. >> >> v2: >> Included support for 25G/50G/100G speeds in supported/ >> advertised speed modes >> Review comments from Ben Hutchings: >> Split the sff-8024 reorganzing patch and QSFP+/QSFP28 >> patch >> Fixed all checkpatch warnings (except couple of over 80 character) >> >> v1: >> Support for SFF-8636 Rev 2.7 >> Review comments from Ben Hutchings: >>Updating copyright holders information for QSFP >>Reusing the common functions and macros across sfpid and qsfp >> >> Vidya Sagar Ravipati (4): >> ethtool-copy.h:sync with net >> ethtool:Reorganizing SFF-8024 fields for SFP/QSFP >> ethtool:QSFP Plus/QSFP28 Diagnostics Information Support >> ethtool: Enhancing link mode bits to support 25G/50G/100G >> >> Makefile.am| 2 +- >> ethtool-copy.h | 18 +- >> ethtool.c | 35 +++ >> internal.h | 3 + >> qsfp.c | 788 >> + >> qsfp.h | 595 +++ >> sff-common.c | 304 ++ >> sff-common.h | 189 ++ >> sfpdiag.c | 105 +--- >> sfpid.c| 103 +--- >> 10 files changed, 1945 insertions(+), 197 deletions(-) >> create mode 100644 qsfp.c >> create mode 100644 qsfp.h >> create mode 100644 sff-common.c >> create mode 100644 sff-common.h >> >> -- >> 2.1.4 >> >> > > -- > John W. LinvilleSomeday the world will need a hero, and you > linvi...@tuxdriver.com might be all we have. Be ready.
Re: CVE-2014-9900 fix is not upstream
On 24.08.2016 16:03, Lennart Sorensen wrote: > On Tue, Aug 23, 2016 at 10:25:45PM +0100, Al Viro wrote: >> Sadly, sizeof is what we use when copying that sucker to userland. So these >> padding bits in the end would've leaked, true enough, and the case is >> somewhat >> weaker. And any normal architecture will have those, but then any such >> architecture will have no more trouble zeroing a 32bit value than 16bit one. > > Hmm, good point. Too bad I don't see a compiler option of "zero all > padding in structs". Certainly generating the code should not really > be that different. > > I see someone did request it 2 years ago: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63479 I don't think this is sufficient. Basically if you write one field in a struct after a memset again, the compiler is allowed by the standard to write padding bytes again, causing them to be undefined. If we want to go down this route, probably the only option is to add __attribute__((pack)) those structs to just have no padding at all, thus breaking uapi. E.g. the x11 protocol implementation specifies padding bytes in their binary representation of the wire protocol to limit the leaking: https://cgit.freedesktop.org/xorg/proto/xproto/tree/Xproto.h ... which would be another option. Bye, Hannes
[PATCH v2 1/6] bpf: add new prog type for cgroup socket filtering
For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in terms of checks during the verification process. It may access the skb as well. Programs of this type will be attached to cgroups for network filtering and accounting. Signed-off-by: Daniel Mack --- include/uapi/linux/bpf.h | 7 +++ kernel/bpf/verifier.c| 1 + net/core/filter.c| 6 ++ 3 files changed, 14 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index e4c5a1b..1d5db42 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -95,6 +95,13 @@ enum bpf_prog_type { BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, + BPF_PROG_TYPE_CGROUP_SOCKET_FILTER, +}; + +enum bpf_attach_type { + BPF_ATTACH_TYPE_CGROUP_INET_INGRESS, + BPF_ATTACH_TYPE_CGROUP_INET_EGRESS, + __MAX_BPF_ATTACH_TYPE }; #define BPF_PSEUDO_MAP_FD 1 diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index abb61f3..12ca880 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1805,6 +1805,7 @@ static bool may_access_skb(enum bpf_prog_type type) case BPF_PROG_TYPE_SOCKET_FILTER: case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: + case BPF_PROG_TYPE_CGROUP_SOCKET_FILTER: return true; default: return false; diff --git a/net/core/filter.c b/net/core/filter.c index a83766b..bc04e5c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = { .type = BPF_PROG_TYPE_XDP, }; +static struct bpf_prog_type_list cg_sk_filter_type __read_mostly = { + .ops= &sk_filter_ops, + .type = BPF_PROG_TYPE_CGROUP_SOCKET_FILTER, +}; + static int __init register_sk_filter_ops(void) { bpf_register_prog_type(&sk_filter_type); bpf_register_prog_type(&sched_cls_type); bpf_register_prog_type(&sched_act_type); bpf_register_prog_type(&xdp_type); + bpf_register_prog_type(&cg_sk_filter_type); return 0; } -- 2.5.5
[PATCH v2 4/6] net: filter: run cgroup eBPF ingress programs
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from sk_filter_trim_cap(). eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the full skb, including the MAC headers. Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack --- net/core/filter.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index bc04e5c..163f75b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -78,6 +78,11 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap) if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) return -ENOMEM; + err = cgroup_bpf_run_filter(sk, skb, + BPF_ATTACH_TYPE_CGROUP_INET_INGRESS); + if (err) + return err; + err = security_sock_rcv_skb(sk, skb); if (err) return err; -- 2.5.5
[PATCH v2 5/6] net: core: run cgroup eBPF egress programs
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from __dev_queue_xmit(). eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the full skb, including the MAC headers. Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack --- net/core/dev.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index a75df86..17484e6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -141,6 +141,7 @@ #include #include #include +#include #include "net-sysfs.h" @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED); + rc = cgroup_bpf_run_filter(skb->sk, skb, + BPF_ATTACH_TYPE_CGROUP_INET_EGRESS); + if (rc) + return rc; + /* Disable soft irqs for various locks below. Also * stops preemption for RCU. */ -- 2.5.5
[PATCH v2 0/6] Add eBPF hooks for cgroups
This is v2 of the patch set to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. The logic also allows to be extendeded for other cgroup based eBPF logic. Changes from v1: * Moved all bpf specific cgroup code into its own file, and stub out related functions for !CONFIG_CGROUP_BPF as static inline nops. This way, the call sites are not cluttered with #ifdef guards while the feature remains compile-time configurable. * Implemented the new scheme proposed by Tejun. Per cgroup, store one set of pointers that are pinned to the cgroup, and one for the programs that are effective. When a program is attached or detached, the change is propagated to all the cgroup's descendants. If a subcgroup has its own pinned program, skip the whole subbranch in order to allow delegation models. * The hookup for egress packets is now done from __dev_queue_xmit(). * A static key is now used in both the ingress and egress fast paths to keep performance penalties close to zero if the feature is not in use. * Overall cleanup to make the accessors use the program arrays. This should make it much easier to add new program types, which will then automatically follow the pinned vs. effective logic. * Fixed locking issues, as pointed out by Eric Dumazet and Alexei Starovoitov. Changes to the program array are now done with xchg() and are protected by cgroup_mutex. * eBPF programs are now expected to return 1 to let the packet pass, not >= 0. Pointed out by Alexei. * Operation is now limited to INET sockets, so local AF_UNIX sockets are not affected. The enum members are renamed accordingly. In case other socket families should be supported, this can be extended in the future. * The sample program learned to support both ingress and egress, and can now optionally make the eBPF program drop packets by making it return 0. As always, feedback is much appreciated. Thanks, Daniel Daniel Mack (6): bpf: add new prog type for cgroup socket filtering cgroup: add support for eBPF programs bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands net: filter: run cgroup eBPF ingress programs net: core: run cgroup eBPF egress programs samples: bpf: add userspace example for attaching eBPF programs to cgroups include/linux/bpf-cgroup.h | 70 ++ include/linux/cgroup-defs.h | 4 + include/uapi/linux/bpf.h| 16 init/Kconfig| 12 +++ kernel/bpf/Makefile | 1 + kernel/bpf/cgroup.c | 159 kernel/bpf/syscall.c| 79 kernel/bpf/verifier.c | 1 + kernel/cgroup.c | 18 + net/core/dev.c | 6 ++ net/core/filter.c | 11 +++ samples/bpf/Makefile| 2 + samples/bpf/libbpf.c| 23 ++ samples/bpf/libbpf.h| 3 + samples/bpf/test_cgrp2_attach.c | 147 + 15 files changed, 552 insertions(+) create mode 100644 include/linux/bpf-cgroup.h create mode 100644 kernel/bpf/cgroup.c create mode 100644 samples/bpf/test_cgrp2_attach.c -- 2.5.5
[PATCH v2 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups
Add a simple userpace program to demonstrate the new API to attach eBPF programs to cgroups. This is what it does: * Create arraymap in kernel with 4 byte keys and 8 byte values * Load eBPF program The eBPF program accesses the map passed in to store two pieces of information. The number of invocations of the program, which maps to the number of packets received, is stored to key 0. Key 1 is incremented on each iteration by the number of bytes stored in the skb. * Detach any eBPF program previously attached to the cgroup * Attach the new program to the cgroup using BPF_PROG_ATTACH * Once a second, read map[0] and map[1] to see how many bytes and packets were seen on any socket of tasks in the given cgroup. The program takes a cgroup path as 1st argument, and either "ingress" or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument, which will make the generated eBPF program return 0 instead of 1, so the kernel will drop the packet. libbpf gained two new wrappers for the new syscall commands. Signed-off-by: Daniel Mack --- samples/bpf/Makefile| 2 + samples/bpf/libbpf.c| 23 +++ samples/bpf/libbpf.h| 3 + samples/bpf/test_cgrp2_attach.c | 147 4 files changed, 175 insertions(+) create mode 100644 samples/bpf/test_cgrp2_attach.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index db3cb06..5c752f5 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -22,6 +22,7 @@ hostprogs-y += spintest hostprogs-y += map_perf_test hostprogs-y += test_overhead hostprogs-y += test_cgrp2_array_pin +hostprogs-y += test_cgrp2_attach hostprogs-y += xdp1 hostprogs-y += xdp2 hostprogs-y += test_current_task_under_cgroup @@ -47,6 +48,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o +test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o xdp1-objs := bpf_load.o libbpf.o xdp1_user.o # reuse xdp1 source intentionally xdp2-objs := bpf_load.o libbpf.o xdp1_user.o diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c index 9969e35..95e196e 100644 --- a/samples/bpf/libbpf.c +++ b/samples/bpf/libbpf.c @@ -104,6 +104,29 @@ int bpf_prog_load(enum bpf_prog_type prog_type, return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); } +int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type) +{ + union bpf_attr attr = { + .target_fd = target_fd, + .attach_bpf_fd = prog_fd, + .attach_type = type, + .attach_flags = 0, + }; + + return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr)); +} + +int bpf_prog_detach(int target_fd, enum bpf_attach_type type) +{ + union bpf_attr attr = { + .target_fd = target_fd, + .attach_type = type, + .attach_flags = 0, + }; + + return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr)); +} + int bpf_obj_pin(int fd, const char *pathname) { union bpf_attr attr = { diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h index 364582b..f973241 100644 --- a/samples/bpf/libbpf.h +++ b/samples/bpf/libbpf.h @@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, int insn_len, const char *license, int kern_version); +int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type); +int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type); + int bpf_obj_pin(int fd, const char *pathname); int bpf_obj_get(const char *pathname); diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c new file mode 100644 index 000..0a44c3d --- /dev/null +++ b/samples/bpf/test_cgrp2_attach.c @@ -0,0 +1,147 @@ +/* eBPF example program: + * + * - Creates arraymap in kernel with 4 bytes keys and 8 byte values + * + * - Loads eBPF program + * + * The eBPF program accesses the map passed in to store two pieces of + * information. The number of invocations of the program, which maps + * to the number of packets received, is stored to key 0. Key 1 is + * incremented on each iteration by the number of bytes stored in + * the skb. + * + * - Detaches any eBPF program previously attached to the cgroup + * + * - Attaches the new program to a cgroup using BPF_PROG_ATTACH + * + * - Every second, reads map[0] and map[1] to see how many bytes and + * packets were seen on any socket of tasks in the given cgroup. + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "libbpf.h" + +enum { + MAP_KEY_PACKETS, + MAP_KEY_BYTES, +}; + +static int prog_load(int map_fd, int verdict) +{ +
[PATCH v2 2/6] cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack --- include/linux/bpf-cgroup.h | 70 +++ include/linux/cgroup-defs.h | 4 ++ init/Kconfig| 12 kernel/bpf/Makefile | 1 + kernel/bpf/cgroup.c | 159 kernel/cgroup.c | 18 + 6 files changed, 264 insertions(+) create mode 100644 include/linux/bpf-cgroup.h create mode 100644 kernel/bpf/cgroup.c diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h new file mode 100644 index 000..d85d50f --- /dev/null +++ b/include/linux/bpf-cgroup.h @@ -0,0 +1,70 @@ +#ifndef _BPF_CGROUP_H +#define _BPF_CGROUP_H + +#include +#include + +struct sock; +struct cgroup; +struct sk_buff; + +#ifdef CONFIG_CGROUP_BPF + +extern struct static_key_false cgroup_bpf_enabled_key; +#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key) + +struct cgroup_bpf { + /* +* Store two sets of bpf_prog pointers, one for programs that are +* pinned directly to this cgroup, and one for those that are effective +* when this cgroup is accessed. +*/ + struct bpf_prog *prog[__MAX_BPF_ATTACH_TYPE]; + struct bpf_prog *prog_effective[__MAX_BPF_ATTACH_TYPE]; +}; + +void cgroup_bpf_free(struct cgroup *cgrp); +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent); + +void __cgroup_bpf_update(struct cgroup *cgrp, +struct cgroup *parent, +struct bpf_prog *prog, +enum bpf_attach_type type); + +/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */ +void cgroup_bpf_update(struct cgroup *cgrp, + struct bpf_prog *prog, + enum bpf_attach_type type); + +int __cgroup_bpf_run_filter(struct sock *sk, + struct sk_buff *skb, + enum bpf_attach_type type); + +/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */ +static inline int cgroup_bpf_run_filter(struct sock *sk, + struct sk_buff *skb, + enum bpf_attach_type type) +{ + if (cgroup_bpf_enabled) + return __cgroup_bpf_run_filter(sk, skb, type); + + return 0; +} + +#else + +struct cgroup_bpf {}; +static inline void cgroup_bpf_free(struct cgroup *cgrp) {} +static inline void cgroup_bpf_inherit(struct cgroup *cgrp, + struct cgroup *parent) {} + +static inline int cgroup_bpf_run_filter(struct sock *sk, + struct sk_buff *skb, + enum bpf_attach_type type) +{ + return 0; +} + +#endif /* CONFIG_CGROUP_BPF */ + +#endif /* _BPF_CGROUP_H */ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 5b17de6..861b467 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -16,6 +16,7 @@ #include #include #include +#include #ifdef CONFIG_CGROUPS @@ -300,6 +301,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work; + /* used to store eBPF programs */ + struct cgroup_bpf bpf; + /* ids of the ancestors at each level including self */ int ancestor_ids[]; }; diff --git a/init/Kconfig b/init/Kconfig index cac3f09..5a89c83 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1144,6 +1144,18 @@ config CGROUP_PERF Say N if unsure. +config CGROUP_BPF + bool "Support for eBPF programs attached to cgroups" + depends on BPF_SYSCALL && SOCK_CGROUP_DATA + help + Allow attaching eBPF programs to a cgroup using the bpf(2) + syscall command BPF_PROG_ATTACH. + + In which context these programs are accessed depends on the type + of attachment. For instance, programs that are attached using + BPF_ATTACH_TYPE_CGROUP_INET_INGRESS will be executed on the + ingress path of inet sockets. + config CGROUP_DEBUG bool "Example controller" default n diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index eed911d..b22256b 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -5,3 +5
[PATCH v2 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack syscall --- include/uapi/linux/bpf.h | 9 ++ kernel/bpf/syscall.c | 79 2 files changed, 88 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1d5db42..4cc2dcf 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -73,6 +73,8 @@ enum bpf_cmd { BPF_PROG_LOAD, BPF_OBJ_PIN, BPF_OBJ_GET, + BPF_PROG_ATTACH, + BPF_PROG_DETACH, }; enum bpf_map_type { @@ -147,6 +149,13 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; }; + + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ + __u32 target_fd; /* container object to attach to */ + __u32 attach_bpf_fd; /* eBPF program to attach */ + __u32 attach_type;/* BPF_ATTACH_TYPE_* */ + __u64 attach_flags; + }; } __attribute__((aligned(8))); /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 228f962..208cba2 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -822,6 +822,75 @@ static int bpf_obj_get(const union bpf_attr *attr) return bpf_obj_get_user(u64_to_ptr(attr->pathname)); } +#ifdef CONFIG_CGROUP_BPF +static int bpf_prog_attach(const union bpf_attr *attr) +{ + struct bpf_prog *prog; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + /* Flags are unused for now */ + if (attr->attach_flags != 0) + return -EINVAL; + + switch (attr->attach_type) { + case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS: + case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: { + struct cgroup *cgrp; + + prog = bpf_prog_get_type(attr->attach_bpf_fd, +BPF_PROG_TYPE_CGROUP_SOCKET_FILTER); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + cgrp = cgroup_get_from_fd(attr->target_fd); + if (IS_ERR(cgrp)) { + bpf_prog_put(prog); + return PTR_ERR(cgrp); + } + + cgroup_bpf_update(cgrp, prog, attr->attach_type); + cgroup_put(cgrp); + + break; + } + + default: + return -EINVAL; + } + + return 0; +} + +static int bpf_prog_detach(const union bpf_attr *attr) +{ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + switch (attr->attach_type) { + case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS: + case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: { + struct cgroup *cgrp; + + cgrp = cgroup_get_from_fd(attr->target_fd); + if (IS_ERR(cgrp)) + return PTR_ERR(cgrp); + + cgroup_bpf_update(cgrp, NULL, attr->attach_type); + cgroup_put(cgrp); + + break; + } + + default: + return -EINVAL; + } + + return 0; +} +#endif /* CONFIG_CGROUP_BPF */ + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; @@ -888,6 +957,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_OBJ_GET: err = bpf_obj_get(&attr); break; + +#ifdef CONFIG_CGROUP_BPF + case BPF_PROG_ATTACH: + err = bpf_prog_attach(&attr); + break; + case BPF_PROG_DETACH: + err = bpf_prog_detach(&attr); + break; +#endif + default: err = -EINVAL; break; -- 2.5.5
Re: [PATCH net-next V2 2/4] net/dst: Utility functions to build dst_metadata without supplying an skb
Hi, On Wed, 24 Aug 2016 15:27:08 +0300 Amir Vadai wrote: > Extract _ip_tun_rx_dst() and _ipv6_tun_rx_dst() out of ip_tun_rx_dst() > and ipv6_tun_rx_dst(), to be used without supplying an skb. Additional thing. In subsequent patches the newly introduced '_ip_tun_rx_dst' and '_ipv6_tun_rx_dst' are used in a non "rx" context (e.g. for constructing a IP_TUNNEL_INFO_TX in act_tunnel_key), so the names are misleading. Consider renaming.
Re: [PATCH net-next V2 2/4] net/dst: Utility functions to build dst_metadata without supplying an skb
Hi, On Wed, 24 Aug 2016 15:27:08 +0300 Amir Vadai wrote: > +static inline struct metadata_dst * > +_ipv6_tun_rx_dst(struct in6_addr saddr, struct in6_addr daddr, > + __u8 tos, __u8 ttl, __be32 label, > + __be16 flags, __be64 tunnel_id, int md_size) > +{ Prefer 'const struct in6_addr *saddr' parameter (daddr too). This is aligned with almost all functions having an 'in6_addr' as a parameter, to prevent the costy argument copy.
Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
On Wed, 2016-08-24 at 11:04 -0700, Rick Jones wrote: > On 08/24/2016 10:23 AM, Eric Dumazet wrote: > > From: Eric Dumazet > > > > per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++; > > Is it possible it is non-trivially slower on other architectures? No, in the worst case, compiler would emit the same code.
Re: [PATCH] net: macb: Increase DMA buffer size
Le 24/08/2016 à 20:25, Xander Huff a écrit : > From: Nathan Sullivan > > In recent testing with the RT patchset, we have seen cases where the > transmit ring can fill even with up to 200 txbds in the ring. Increase > the size of the DMA rings to avoid overruns. > > Signed-off-by: Nathan Sullivan > Acked-by: Ben Shelton > Acked-by: Jaeden Amero > Natinst-ReviewBoard-ID: 83662 > --- > drivers/net/ethernet/cadence/macb.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/cadence/macb.c > b/drivers/net/ethernet/cadence/macb.c > index 3256839..86a8e20 100644 > --- a/drivers/net/ethernet/cadence/macb.c > +++ b/drivers/net/ethernet/cadence/macb.c > @@ -35,12 +35,12 @@ > > #include "macb.h" > > -#define MACB_RX_BUFFER_SIZE 128 > +#define MACB_RX_BUFFER_SIZE 1536 This change seems not covered by the commit message. Can you please separate the changes in 2 patches or elaborate a bit more the reason for this RX buffer size change. Bye, > #define RX_BUFFER_MULTIPLE 64 /* bytes */ > #define RX_RING_SIZE 512 /* must be power of 2 */ > #define RX_RING_BYTES(sizeof(struct macb_dma_desc) * > RX_RING_SIZE) > > -#define TX_RING_SIZE 128 /* must be power of 2 */ > +#define TX_RING_SIZE 512 /* must be power of 2 */ > #define TX_RING_BYTES(sizeof(struct macb_dma_desc) * > TX_RING_SIZE) > > /* level of occupied TX descriptors under which we wake up TX process */ > -- Nicolas Ferre
Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
On Wed, 2016-08-24 at 11:00 -0700, John Fastabend wrote: > Looks good to me. I guess we can also do the same for overlimit qstats. > > Acked-by: John Fastabend Not sure about overlimit, although I could probably change these : net/sched/act_bpf.c:85: qstats_drop_inc(this_cpu_ptr(prog->common.cpu_qstats)); net/sched/act_gact.c:145: qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));
Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats
On Wed, 2016-08-24 at 10:50 -0700, John Fastabend wrote: > On 16-08-24 10:26 AM, Eric Dumazet wrote: > > On Wed, 2016-08-24 at 10:13 -0700, John Fastabend wrote: > > > >>> > >> > >> I could fully allocate it in qdisc_alloc() but we don't know if the > >> qdisc needs per cpu data structures until after the init call > > > > Should not we have a flag to advertise the need of per spu stats on > > qdisc ? > > > > This is not clear why ->init() can know this, and not its caller. > > > > sure we could be a static_flags field in the ops structure. What do > you think about doing that? This is what I was suggesting yes. > > We would still need some flags to be set at init though like the bypass > bit it looks like some qdiscs set that based on user input. >
Re: [PATCH net-next V2 4/4] net/sched: Introduce act_tunnel_key
Hi, On Wed, 24 Aug 2016 15:27:10 +0300 Amir Vadai wrote: > +config NET_ACT_TUNNEL_KEY > +tristate "IP tunnel metadata manipulation" > +depends on NET_CLS_ACT > +---help--- > + Say Y here to set/release ip tunnel metadata. > + > + If unsure, say N. > + > + To compile this code as a module, choose M here: the > + module will be called act_tunnel. actually looks like it's called "act_tunnel_key" ;) > +static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a, > + struct tcf_result *res) > +{ > + struct tcf_tunnel_key *t = to_tunnel_key(a); > + int action; > + > + spin_lock(&t->tcf_lock); > + tcf_lastuse_update(&t->tcf_tm); > + bstats_update(&t->tcf_bstats, skb); > + action = t->tcf_action; > + > + switch (t->tcft_action) { > + case TCA_TUNNEL_KEY_ACT_RELEASE: > + skb_dst_set_noref(skb, NULL); > + break; > + case TCA_TUNNEL_KEY_ACT_SET: > + skb_dst_set_noref(skb, &t->tcft_enc_metadata->dst); > + > + break; nit: empty line unneeded here. > +static int tunnel_key_init(struct net *net, struct nlattr *nla, > +struct nlattr *est, struct tc_action **a, > +int ovr, int bind) > +{ > + struct tc_action_net *tn = net_generic(net, tunnel_key_net_id); > + struct nlattr *tb[TCA_TUNNEL_KEY_MAX + 1]; > + struct metadata_dst *metadata = NULL; > + struct tc_tunnel_key *parm; > + struct tcf_tunnel_key *t; > + __be64 key_id; > + int encapdecap; > + bool exists = false; > + int ret = 0; > + int err; > + > + if (!nla) > + return -EINVAL; > + > + err = nla_parse_nested(tb, TCA_TUNNEL_KEY_MAX, nla, tunnel_key_policy); > + if (err < 0) > + return err; > + > + if (!tb[TCA_TUNNEL_KEY_PARMS]) > + return -EINVAL; > + > + parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]); > + exists = tcf_hash_check(tn, parm->index, a, bind); > + if (exists && bind) > + return 0; > + > + encapdecap = parm->t_action; > + > + switch (encapdecap) { As we no longer have "encapdecap" actions, either rename or just use parm->t_action explicitly (only needed twice). > +static int tunnel_key_dump_addresses(struct sk_buff *skb, > + const struct ip_tunnel_info *info) > +{ > + unsigned short family = ip_tunnel_info_af(info); > + > + if (family == AF_INET) { > + __be32 saddr = info->key.u.ipv4.src; > + __be32 daddr = info->key.u.ipv4.dst; > + > + if (!nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_IPV4_SRC, saddr) && > + !nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_IPV4_DST, daddr)) > + return 0; > + } > + > + if (family == AF_INET6) { > + struct in6_addr saddr6 = info->key.u.ipv6.src; > + struct in6_addr daddr6 = info->key.u.ipv6.dst; Why the in6_addr copy? Point to the things, then pass the pointers to nla_put_in6_addr(). Also, there are few lines too long. Regards, Shmulik
Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO
On 8/24/16 11:41 AM, pravin shelar wrote: > You also need to change pop_mpls(). What change is needed in pop_mpls? It already resets the mac_header and if MPLS labels are removed there is no need to set network_header. I take it you mean if the protocol is still MPLS and there are still labels then the network header needs to be set and that means finding the bottom label. Does OVS set the bottom of stack bit? From what I can tell OVS is not parsing the MPLS label so no requirement that BOS is set. Without that there is no way to tell when the labels are done short of guessing. > > Anyways I was thinking about the neigh output functions skb pull > issue, where it is using network-header offset. Can we use mac_len? > this way we would not use any inner offsets for MPLS skb and current > scheme used by OVS datapath works. neigh_resolve_output and neigh_connected_output both do an __skb_pull to the network offset. When these functions are called there may or may not be a mac header set in the skb making the mac_header unreliable for how you want to use it. e.g. I tried this: diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 2ae929f9bd06..9f20a0b8e6be 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1292,12 +1292,16 @@ int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb) int err; struct net_device *dev = neigh->dev; unsigned int seq; + unsigned int offset = skb_network_offset(skb); + + if (unlikely(skb_mac_header_was_set(skb))) + offset = skb_mac_header(skb) - skb->data; if (dev->header_ops->cache && !neigh->hh.hh_len) neigh_hh_init(neigh); do { - __skb_pull(skb, skb_network_offset(skb)); + __skb_pull(skb, offset); seq = read_seqbegin(&neigh->ha_lock); err = dev_hard_header(skb, dev, ntohs(skb->protocol), neigh->ha, NULL, skb->len); It does not work. The MPLS packet goes down the stack fine, but when the packet is forwarded from one namespace to another you can get a panic since it hits neigh_resolve_output with a mac header and the pull above will do the wrong thing. [ 18.254133] BUG: unable to handle kernel paging request at 88023860404a [ 18.255566] IP: [] eth_header+0x40/0xaf [ 18.256649] PGD 1c40067 PUD 0 [ 18.257277] Oops: 0002 [#1] SMP [ 18.257872] Modules linked in: veth 8021q garp mrp stp llc vrf [ 18.259168] CPU: 2 PID: 868 Comm: ping Not tainted 4.8.0-rc2+ #81 [ 18.260308] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 18.262184] task: 88013ab61040 task.stack: 88013509 [ 18.263285] RIP: 0010:[] [] eth_header+0x40/0xaf [ 18.264762] RSP: 0018:88013fd03c80 EFLAGS: 00010216 [ 18.265791] RAX: 88023860403e RBX: 0008 RCX: 88013a5c18a0 [ 18.267040] RDX: 88023860403e RSI: 000e RDI: 88013ab0a200 [ 18.268307] RBP: 88013fd03ca8 R08: R09: 0058 [ 18.269556] R10: 88023860403e R11: R12: 88013a5c18a0 [ 18.270807] R13: 880135b0b000 R14: 880135b0b000 R15: 88013a5c1828 [ 18.272064] FS: 7fbc44b66700() GS:88013fd0() knlGS: [ 18.273477] CS: 0010 DS: ES: CR0: 80050033 [ 18.274492] CR2: 88023860404a CR3: 0001350c8000 CR4: 000406e0 [ 18.275746] Stack: [ 18.276125] 00580246 88013ab0a200 0002 [ 18.277519] 88013a5c1800 88013fd03cb8 813d5912 88013fd03d00 [ 18.278904] 813d73ea 88013a5c18a0 fffc01000246 88013a5c1838 [ 18.280295] Call Trace: [ 18.280712] [ 18.281049] [] dev_hard_header.constprop.42+0x26/0x28 [ 18.282204] [] neigh_resolve_output+0x1b9/0x270 [ 18.283228] [] neigh_update+0x372/0x497 [ 18.284160] [] arp_process+0x520/0x572 [ 18.285061] [] arp_rcv+0x10e/0x17d [ 18.285909] [] __netif_receive_skb_core+0x3ea/0x4b8 [ 18.286995] [] __netif_receive_skb+0x16/0x66 [ 18.287993] [] process_backlog+0xa4/0x132 [ 18.288935] [] net_rx_action+0xd1/0x242 [ 18.289854] [] __do_softirq+0x100/0x26d [ 18.290764] [] do_softirq_own_stack+0x1c/0x30 [ 18.291775] [ 18.292100] [] do_softirq+0x30/0x3b [ 18.292968] [] __local_bh_enable_ip+0x69/0x73 [ 18.293919] [] local_bh_enable+0x15/0x17 [ 18.294798] [] neigh_xmit+0x93/0xe3 [ 18.295626] [] mpls_xmit+0x379/0x3c0 [ 18.296464] [] lwtunnel_xmit+0x48/0x63 Generically though this approach just feels wrong. You want to lump the MPLS labels with the ethernet header but not formally, just by playing games with skb markers. The core networking stack is resisting this approach.
[PATCH] net: macb: Increase DMA buffer size
From: Nathan Sullivan In recent testing with the RT patchset, we have seen cases where the transmit ring can fill even with up to 200 txbds in the ring. Increase the size of the DMA rings to avoid overruns. Signed-off-by: Nathan Sullivan Acked-by: Ben Shelton Acked-by: Jaeden Amero Natinst-ReviewBoard-ID: 83662 --- drivers/net/ethernet/cadence/macb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c index 3256839..86a8e20 100644 --- a/drivers/net/ethernet/cadence/macb.c +++ b/drivers/net/ethernet/cadence/macb.c @@ -35,12 +35,12 @@ #include "macb.h" -#define MACB_RX_BUFFER_SIZE128 +#define MACB_RX_BUFFER_SIZE1536 #define RX_BUFFER_MULTIPLE 64 /* bytes */ #define RX_RING_SIZE 512 /* must be power of 2 */ #define RX_RING_BYTES (sizeof(struct macb_dma_desc) * RX_RING_SIZE) -#define TX_RING_SIZE 128 /* must be power of 2 */ +#define TX_RING_SIZE 512 /* must be power of 2 */ #define TX_RING_BYTES (sizeof(struct macb_dma_desc) * TX_RING_SIZE) /* level of occupied TX descriptors under which we wake up TX process */ -- 1.9.1
Re: [PATCH] phy: request shared IRQ
Hello. On 08/24/2016 08:53 PM, Xander Huff wrote: From: Nathan Sullivan On hardware with multiple PHY devices hooked up to the same IRQ line, allow them to share it. Note that it had been allowed until my (erroneous?) commit 33c133cc7598e60976a069344910d63e56cc4401 ("phy: IRQ cannot be shared"), so I'd like this commit just reverted instead... I'm not sure now what was the reason I concluded that the IRQ sharing was impossible... most probably I thought that the kernel IRQ handling code exited the loop over the IRQ actions once IRQ_HANDLED was returned -- which is obviously not so in reality... Signed-off-by: Nathan Sullivan Signed-off-by: Xander Huff Acked-by: Ben Shelton Acked-by: Jaeden Amero [...] MBR, Sergei
[PATCH iproute] iptuntap: show processes using tuntap interface
Show which processes are using which tun/tap devices, e.g.: $ ip -d tuntap tun0: tun Attached to processes: vpnc(9531) vnet0: tap vnet_hdr Attached to processes: qemu-system-x86(10442) virbr0-nic: tap UNKNOWN_FLAGS:800 Attached to processes: Signed-off-by: Hannes Frederic Sowa --- ip/iptuntap.c | 109 ++ 1 file changed, 109 insertions(+) diff --git a/ip/iptuntap.c b/ip/iptuntap.c index 43774f96e335ef..b5aa0542c1f8f2 100644 --- a/ip/iptuntap.c +++ b/ip/iptuntap.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "rt_names.h" #include "utils.h" @@ -273,6 +274,109 @@ static void print_flags(long flags) printf(" UNKNOWN_FLAGS:%lx", flags); } +static char *pid_name(pid_t pid) +{ + char *comm; + FILE *f; + int err; + + err = asprintf(&comm, "/proc/%d/comm", pid); + if (err < 0) + return NULL; + + f = fopen(comm, "r"); + free(comm); + if (!f) { + perror("fopen"); + return NULL; + } + + if (fscanf(f, "%ms\n", &comm) != 1) { + perror("fscanf"); + comm = NULL; + } + + + if (fclose(f)) + perror("fclose"); + + return comm; +} + +static void show_processes(const char *name) +{ + glob_t globbuf = { }; + char **fd_path; + int err; + + err = glob("/proc/[0-9]*/fd/[0-9]*", GLOB_NOSORT, + NULL, &globbuf); + if (err) + return; + + fd_path = globbuf.gl_pathv; + while (*fd_path) { + const char *dev_net_tun = "/dev/net/tun"; + const size_t linkbuf_len = strlen(dev_net_tun) + 2; + char linkbuf[linkbuf_len], *fdinfo; + int pid, fd; + FILE *f; + + if (sscanf(*fd_path, "/proc/%d/fd/%d", &pid, &fd) != 2) + goto next; + + if (pid == getpid()) + goto next; + + err = readlink(*fd_path, linkbuf, linkbuf_len - 1); + if (err < 0) { + perror("readlink"); + goto next; + } + linkbuf[err] = '\0'; + if (strcmp(dev_net_tun, linkbuf)) + goto next; + + if (asprintf(&fdinfo, "/proc/%d/fdinfo/%d", pid, fd) < 0) + goto next; + + f = fopen(fdinfo, "r"); + free(fdinfo); + if (!f) { + perror("fopen"); + goto next; + } + + while (!feof(f)) { + char *key = NULL, *value = NULL; + + err = fscanf(f, "%m[^:]: %ms\n", &key, &value); + if (err == EOF) { + if (ferror(f)) + perror("fscanf"); + break; + } else if (err == 2 && + !strcmp("iff", key) && !strcmp(name, value)) { + char *pname = pid_name(pid); + printf(" %s(%d)", pname ? pname : "", pid); + free(pname); + } + + free(key); + free(value); + } + if (fclose(f)) + perror("fclose"); + +next: + ++fd_path; + } + + globfree(&globbuf); + return; +} + + static int do_show(int argc, char **argv) { DIR *dir; @@ -302,6 +406,11 @@ static int do_show(int argc, char **argv) if (group != -1) printf(" group %ld", group); printf("\n"); + if (show_details) { + printf("\tAttached to processes:"); + show_processes(d->d_name); + printf("\n"); + } } closedir(dir); return 0; -- 2.7.4
Re: [PATCH net-next V2 1/4] net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id()
On Wed, 24 Aug 2016 15:27:07 +0300 Amir Vadai wrote: > Add utility functions to convert a 32 bits key into a 64 bits tunnel and > vice versa. > These functions will be used instead of cloning code in GRE and VXLAN, > and in tc act_iptunnel which will be introduced in a following patch in > this patchset. > > Signed-off-by: Amir Vadai Reviewed-by: Shmulik Ladkani
Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats
On 16-08-24 09:41 AM, Eric Dumazet wrote: > On Tue, 2016-08-23 at 13:24 -0700, John Fastabend wrote: >> Enable dflt qdisc support for per cpu stats before this patch a >> dflt qdisc was required to use the global statistics qstats and >> bstats. >> >> Signed-off-by: John Fastabend >> --- >> net/sched/sch_generic.c | 24 >> 1 file changed, 20 insertions(+), 4 deletions(-) >> >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >> index 80544c2..910b4d15 100644 >> --- a/net/sched/sch_generic.c >> +++ b/net/sched/sch_generic.c >> @@ -646,18 +646,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue >> *dev_queue, >> struct Qdisc *sch; >> >> if (!try_module_get(ops->owner)) >> -goto errout; >> +return NULL; >> >> sch = qdisc_alloc(dev_queue, ops); >> if (IS_ERR(sch)) >> -goto errout; >> +return NULL; >> sch->parent = parentid; >> >> -if (!ops->init || ops->init(sch, NULL) == 0) >> +if (!ops->init) >> return sch; >> >> -qdisc_destroy(sch); >> +if (ops->init(sch, NULL)) >> +goto errout; >> + >> +/* init() may have set percpu flags so init data structures */ >> +if (qdisc_is_percpu_stats(sch)) { >> +sch->cpu_bstats = >> +netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu); >> +if (!sch->cpu_bstats) >> +goto errout; >> + >> +sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue); >> +if (!sch->cpu_qstats) >> +goto errout; >> +} >> + > > Why are you attempting these allocations here instead of qdisc_alloc() > > This looks weird, I would expect base qdisc being fully allocated before > ops->init() is attempted. > > > I could fully allocate it in qdisc_alloc() but we don't know if the qdisc needs per cpu data structures until after the init call. So it would sit unused in those cases if done from qdisc_alloc(). It seems best to me at least to just avoid the allocation in qdisc_alloc() and do it after init like I did here. Perhaps it would be nice to pull these into a function call post_init_qdisc_alloc() that does all this allocation? .John
Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
On 08/24/2016 10:23 AM, Eric Dumazet wrote: From: Eric Dumazet per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++; Is it possible it is non-trivially slower on other architectures? rick jones Signed-off-by: Eric Dumazet --- include/net/sch_generic.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch) static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch) { - qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats)); + this_cpu_inc(sch->cpu_qstats->drops); } static inline void qdisc_qstats_overlimit(struct Qdisc *sch)
[PATCH net] net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
We kept shadow copies of which interrupt sources we have enabled and disabled, but due to an order bug in how intrl2_mask_clear was defined, we could run into the following scenario: CPU0CPU1 intrl2_1_mask_clear(..) sets INTRL2_CPU_MASK_CLEAR bcm_sf2_switch_1_isr read INTRL2_CPU_STATUS and masks with stale irq1_mask value updates irq1_mask value Which would make us loop again and again trying to process and interrupt we are not clearing since our copy of whether it was enabled before still indicates it was not. Fix this by updating the shadow copy first, and then unasking at the HW level. Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: Florian Fainelli --- drivers/net/dsa/bcm_sf2.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h index 463bed8cbe4c..dd446e466699 100644 --- a/drivers/net/dsa/bcm_sf2.h +++ b/drivers/net/dsa/bcm_sf2.h @@ -205,8 +205,8 @@ static inline void name##_writeq(struct bcm_sf2_priv *priv, u64 val,\ static inline void intrl2_##which##_mask_clear(struct bcm_sf2_priv *priv, \ u32 mask) \ { \ - intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \ priv->irq##which##_mask &= ~(mask); \ + intrl2_##which##_writel(priv, mask, INTRL2_CPU_MASK_CLEAR); \ } \ static inline void intrl2_##which##_mask_set(struct bcm_sf2_priv *priv, \ u32 mask) \ -- 2.7.4
Re: [PATCH net-next] tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter
On Wed, 24 Aug 2016 09:01:23 -0700 Eric Dumazet wrote: > From: Eric Dumazet > > Adds SNMP counter for drops caused by MD5 mismatches. > > The current syslog might help, but a counter is more precise and helps > monitoring. > > Signed-off-by: Eric Dumazet > --- > include/uapi/linux/snmp.h |1 + > net/ipv4/proc.c |1 + > net/ipv4/tcp_ipv4.c |1 + > net/ipv6/tcp_ipv6.c |1 + > 4 files changed, 4 insertions(+) > > diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h > index > 25a9ad8bcef1240915f2553a8acade447186d869..e7a31f8306903f53bc5881ae4c271f85cad2e361 > 100644 > --- a/include/uapi/linux/snmp.h > +++ b/include/uapi/linux/snmp.h > @@ -235,6 +235,7 @@ enum > LINUX_MIB_TCPSPURIOUSRTOS, /* TCPSpuriousRTOs */ > LINUX_MIB_TCPMD5NOTFOUND, /* TCPMD5NotFound */ > LINUX_MIB_TCPMD5UNEXPECTED, /* TCPMD5Unexpected */ > + LINUX_MIB_TCPMD5FAILURE,/* TCPMD5Failure */ > LINUX_MIB_SACKSHIFTED, > LINUX_MIB_SACKMERGED, > LINUX_MIB_SACKSHIFTFALLBACK, You can't add value in middle of user API enum without breaking binary compatibility.
Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
On 16-08-24 10:23 AM, Eric Dumazet wrote: > From: Eric Dumazet > > per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++; > > Signed-off-by: Eric Dumazet > --- > include/net/sch_generic.h |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index > 0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5 > 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch) > > static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch) > { > - qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats)); > + this_cpu_inc(sch->cpu_qstats->drops); > } > > static inline void qdisc_qstats_overlimit(struct Qdisc *sch) > > Looks good to me. I guess we can also do the same for overlimit qstats. Acked-by: John Fastabend
[PATCH] phy: request shared IRQ
From: Nathan Sullivan On hardware with multiple PHY devices hooked up to the same IRQ line, allow them to share it. Signed-off-by: Nathan Sullivan Signed-off-by: Xander Huff Acked-by: Ben Shelton Acked-by: Jaeden Amero --- drivers/net/phy/phy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index c5dc2c36..0050531 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -722,8 +722,8 @@ phy_err: int phy_start_interrupts(struct phy_device *phydev) { atomic_set(&phydev->irq_disable, 0); - if (request_irq(phydev->irq, phy_interrupt, 0, "phy_interrupt", - phydev) < 0) { + if (request_irq(phydev->irq, phy_interrupt, IRQF_SHARED, + "phy_interrupt", phydev) < 0) { pr_warn("%s: Can't get IRQ %d (PHY)\n", phydev->mdio.bus->name, phydev->irq); phydev->irq = PHY_POLL; -- 1.9.1
Re: [PATCH net-next] tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter
On Wed, 2016-08-24 at 10:35 -0700, Stephen Hemminger wrote: > You can't add value in middle of user API enum without breaking > binary compatibility. There is no binary compatibility here. /proc/net/netstat is a text file with a defined format. First line contains the headers. If 'binary compatibility 'was an issue, we would not have added anything in this file. Programs need to be able to properly parse these TcpExt: lines. nstat is doing the right thing. I could put LINUX_MIB_TCPMD5FAILURE at the end, but 'nstat' would have these MD5 counters in different places. So for the few people (ie not programs) looking at nstat, it seems better to place this MIB at this point.
Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats
On 16-08-24 10:26 AM, Eric Dumazet wrote: > On Wed, 2016-08-24 at 10:13 -0700, John Fastabend wrote: > >>> >> >> I could fully allocate it in qdisc_alloc() but we don't know if the >> qdisc needs per cpu data structures until after the init call > > Should not we have a flag to advertise the need of per spu stats on > qdisc ? > > This is not clear why ->init() can know this, and not its caller. > sure we could be a static_flags field in the ops structure. What do you think about doing that? We would still need some flags to be set at init though like the bypass bit it looks like some qdiscs set that based on user input. >> . So it >> would sit unused in those cases if done from qdisc_alloc(). It seems >> best to me at least to just avoid the allocation in qdisc_alloc() and >> do it after init like I did here. >> >> Perhaps it would be nice to pull these into a function call >> post_init_qdisc_alloc() that does all this allocation? >> >> .John >> > >
Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO
On Wed, Aug 24, 2016 at 9:37 AM, David Ahern wrote: > On 8/24/16 10:28 AM, pravin shelar wrote: >>> How do you feel about implementing the do_output() idea I suggested above? >>> I'm happy to provide testing and review. >> >> I am not sure about changing do_output(). why not just use same scheme >> to track mpls header in OVS datapath as done in mpls device? >> > > was just replying with the same. > > Something like this should be able to handle multiple labels. The inner > network header is set once and the outer one pointing to MPLS is adjusted > each time a label is pushed: > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 1ecbd7715f6d..0f37b17e3a73 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -162,10 +162,16 @@ static int push_mpls(struct sk_buff *skb, struct > sw_flow_key *key, > if (skb_cow_head(skb, MPLS_HLEN) < 0) > return -ENOMEM; > > + if (!skb->inner_protocol) { > + skb_set_inner_network_header(skb, skb->mac_len); > + skb_set_inner_protocol(skb, skb->protocol); > + } > + > skb_push(skb, MPLS_HLEN); > memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb), > skb->mac_len); > skb_reset_mac_header(skb); > + skb_set_network_header(skb, skb->mac_len); > > new_mpls_lse = (__be32 *)skb_mpls_header(skb); > *new_mpls_lse = mpls->mpls_lse; > @@ -173,8 +179,7 @@ static int push_mpls(struct sk_buff *skb, struct > sw_flow_key *key, > skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN); > > update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype); > - if (!skb->inner_protocol) > - skb_set_inner_protocol(skb, skb->protocol); > + > skb->protocol = mpls->mpls_ethertype; > > invalidate_flow_key(key); > > > > > If it does, what else needs to be changed in OVS to handle the network layer > now pointing to the MPLS labels? > You also need to change pop_mpls(). Anyways I was thinking about the neigh output functions skb pull issue, where it is using network-header offset. Can we use mac_len? this way we would not use any inner offsets for MPLS skb and current scheme used by OVS datapath works.
Re: [PATCH net-next v1] gso: Support partial splitting at the frag_list pointer
Em 24-08-2016 13:27, Alexander Duyck escreveu: On Wed, Aug 24, 2016 at 2:32 AM, Steffen Klassert wrote: On Tue, Aug 23, 2016 at 07:47:32AM -0700, Alexander Duyck wrote: On Mon, Aug 22, 2016 at 10:20 PM, Steffen Klassert wrote: Since commit 8a29111c7 ("net: gro: allow to build full sized skb") gro may build buffers with a frag_list. This can hurt forwarding because most NICs can't offload such packets, they need to be segmented in software. This patch splits buffers with a frag_list at the frag_list pointer into buffers that can be TSO offloaded. Signed-off-by: Steffen Klassert --- net/core/skbuff.c | 89 +- net/ipv4/af_inet.c | 7 ++-- net/ipv4/gre_offload.c | 7 +++- net/ipv4/tcp_offload.c | 3 ++ net/ipv4/udp_offload.c | 9 +++-- net/ipv6/ip6_offload.c | 6 +++- 6 files changed, 114 insertions(+), 7 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3864b4b6..a614e9d 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3078,6 +3078,92 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, sg = !!(features & NETIF_F_SG); csum = !!can_checksum_protocol(features, proto); + headroom = skb_headroom(head_skb); + + if (list_skb && net_gso_ok(features, skb_shinfo(head_skb)->gso_type) && + csum && sg && (mss != GSO_BY_FRAGS) && + !(features & NETIF_F_GSO_PARTIAL)) { Does this really need to be mutually exclusive with NETIF_F_GSO_PARTIAL and GSO_BY_FRAGS? It should be possible to extend this to NETIF_F_GSO_PARTIAL but I have no test for this. Regarding GSO_BY_FRAGS, this is rather new and just used for sctp. I don't know what sctp does with GSO_BY_FRAGS. I'm adding Marcelo as he could probably explain the GSO_BY_FRAGS functionality better than I could since he is the original author. If I recall GSO_BY_FRAGS does something similar to what you are doing, although I believe it doesn't carry any data in the first buffer other than just a header. I believe the idea behind GSO_BY_FRAGS was to allow for segmenting a frame at the frag_list level instead of having it done just based on MSS. That was the only reason why I brought it up. That's exactly it. On this no data in the first buffer limitation, we probably can allow it have some data in there. It was done this way just because sctp is using skb_gro_receive() to build such skb and this was the way I found to get such frag_list skb generated by it, thus preserving frame boundaries. For using GSO_BY_FRAGS in gso_size, it's how skb_is_gso() returns true, but it's similar to the SKB_GSO_PARTIAL rationale in here. We can make sctp also flag it as SKB_GSO_PARTIAL if needed I guess, in case you need to maintain gso_size value. Marcelo In you case though we maybe be able to make this easier. If I am not mistaken I believe we should have the main skb, and any in the chain excluding the last containing the same amount of data. That being the case we should be able to determine the size that you would need to segment at by taking skb->len, and removing the length of all the skbuffs hanging off of frag_list. At that point you just use that as your MSS for segmentation and it should break things up so that you have a series of equal sized segments split as the frag_list buffer boundaries. After that all that is left is to update the gso info for the buffers. For GSO_PARTIAL I was handling that on the first segment only. For this change you would need to update that code to address the fact that you would have to determine the number of segments on the first frame and the last since the last could be less than the first, but all of the others in-between should have the same number of segments. - Alex
[PATCH -next] ibmvnic: convert to use simple_open()
From: Wei Yongjun Remove an open coded simple_open() function and replace file operations references to the function with simple_open() instead. Generated by: scripts/coccinelle/api/simple_open.cocci Signed-off-by: Wei Yongjun --- drivers/net/ethernet/ibm/ibmvnic.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index b942108..e862530 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -2779,12 +2779,6 @@ static void handle_control_ras_rsp(union ibmvnic_crq *crq, } } -static int ibmvnic_fw_comp_open(struct inode *inode, struct file *file) -{ - file->private_data = inode->i_private; - return 0; -} - static ssize_t trace_read(struct file *file, char __user *user_buf, size_t len, loff_t *ppos) { @@ -2836,7 +2830,7 @@ static ssize_t trace_read(struct file *file, char __user *user_buf, size_t len, static const struct file_operations trace_ops = { .owner = THIS_MODULE, - .open = ibmvnic_fw_comp_open, + .open = simple_open, .read = trace_read, }; @@ -2886,7 +2880,7 @@ static ssize_t paused_write(struct file *file, const char __user *user_buf, static const struct file_operations paused_ops = { .owner = THIS_MODULE, - .open = ibmvnic_fw_comp_open, + .open = simple_open, .read = paused_read, .write = paused_write, }; @@ -2934,7 +2928,7 @@ static ssize_t tracing_write(struct file *file, const char __user *user_buf, static const struct file_operations tracing_ops = { .owner = THIS_MODULE, - .open = ibmvnic_fw_comp_open, + .open = simple_open, .read = tracing_read, .write = tracing_write, }; @@ -2987,7 +2981,7 @@ static ssize_t error_level_write(struct file *file, const char __user *user_buf, static const struct file_operations error_level_ops = { .owner = THIS_MODULE, - .open = ibmvnic_fw_comp_open, + .open = simple_open, .read = error_level_read, .write = error_level_write, }; @@ -3038,7 +3032,7 @@ static ssize_t trace_level_write(struct file *file, const char __user *user_buf, static const struct file_operations trace_level_ops = { .owner = THIS_MODULE, - .open = ibmvnic_fw_comp_open, + .open = simple_open, .read = trace_level_read, .write = trace_level_write, }; @@ -3091,7 +3085,7 @@ static ssize_t trace_buff_size_write(struct file *file, static const struct file_operations trace_size_ops = { .owner = THIS_MODULE, - .open = ibmvnic_fw_comp_open, + .open = simple_open, .read = trace_buff_size_read, .write = trace_buff_size_write, };
Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats
On Wed, 2016-08-24 at 10:13 -0700, John Fastabend wrote: > > > > I could fully allocate it in qdisc_alloc() but we don't know if the > qdisc needs per cpu data structures until after the init call Should not we have a flag to advertise the need of per spu stats on qdisc ? This is not clear why ->init() can know this, and not its caller. > . So it > would sit unused in those cases if done from qdisc_alloc(). It seems > best to me at least to just avoid the allocation in qdisc_alloc() and > do it after init like I did here. > > Perhaps it would be nice to pull these into a function call > post_init_qdisc_alloc() that does all this allocation? > > .John >
Re: [PATCH net] qdisc: fix a module refcount leak in qdisc_create_dflt()
On 16-08-24 09:39 AM, Eric Dumazet wrote: > From: Eric Dumazet > > Should qdisc_alloc() fail, we must release the module refcount > we got right before. > > Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") > Signed-off-by: Eric Dumazet > --- > net/sched/sch_generic.c |9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index e95b67cd5718..657c13362b19 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -643,18 +643,19 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue > *dev_queue, > struct Qdisc *sch; > > if (!try_module_get(ops->owner)) > - goto errout; > + return NULL; > > sch = qdisc_alloc(dev_queue, ops); > - if (IS_ERR(sch)) > - goto errout; > + if (IS_ERR(sch)) { > + module_put(ops->owner); > + return NULL; > + } > sch->parent = parentid; > > if (!ops->init || ops->init(sch, NULL) == 0) > return sch; > > qdisc_destroy(sch); > -errout: > return NULL; > } > EXPORT_SYMBOL(qdisc_create_dflt); > > Thanks! Acked-by: John Fastabend
[PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
From: Eric Dumazet per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++; Signed-off-by: Eric Dumazet --- include/net/sch_generic.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch) static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch) { - qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats)); + this_cpu_inc(sch->cpu_qstats->drops); } static inline void qdisc_qstats_overlimit(struct Qdisc *sch)
Re: [PATCH 0/4] net: phy: Register header file for Microsemi PHYs.
On 08/24/2016 04:58 AM, Raju Lakkaraju wrote: > From: Nagaraju Lakkaraju > > This is Microsemi's VSC 85xx PHY register definitions header file. Please keep these register definitions local to the code using them unless they are shared between multiple drivers. -- Florian
Re: [PATCH 3/3] net: fs_enet: make rx_copybreak value configurable
On 08/24/2016 03:36 AM, Christophe Leroy wrote: > Measurement shows that on a MPC8xx running at 132MHz, the optimal > limit is 112: > * 114 bytes packets are processed in 147 TB ticks with higher copybreak > * 114 bytes packets are processed in 148 TB ticks with lower copybreak > * 128 bytes packets are processed in 154 TB ticks with higher copybreak > * 128 bytes packets are processed in 148 TB ticks with lower copybreak > * 238 bytes packets are processed in 172 TB ticks with higher copybreak > * 238 bytes packets are processed in 148 TB ticks with lower copybreak > > However it might be different on other processors > and/or frequencies. So it is useful to make it configurable. > > Signed-off-by: Christophe Leroy > --- > drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c | 8 +--- > include/linux/fs_enet_pd.h| 1 - > 2 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c > b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c > index addcae6..b59bbf8 100644 > --- a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c > +++ b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c > @@ -60,6 +60,10 @@ module_param(fs_enet_debug, int, 0); > MODULE_PARM_DESC(fs_enet_debug, >"Freescale bitmapped debugging message enable value"); > > +static int rx_copybreak = 240; > +module_param(rx_copybreak, int, S_IRUGO | S_IWUSR); > +MODULE_PARM_DESC(rx_copybreak, "Receive copy threshold"); There is an ethtool tunable knob for copybreak now, which you should prefer over a module parameter, see drivers/net/ethernet/cisco/enic/enic_ethtool.c -- Florian
Re: [PATCH net] qdisc: fix a module refcount leak in qdisc_create_dflt()
On 16-08-24 09:39 AM, Eric Dumazet wrote: > From: Eric Dumazet > > Should qdisc_alloc() fail, we must release the module refcount > we got right before. > > Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") > Signed-off-by: Eric Dumazet > --- > net/sched/sch_generic.c |9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index e95b67cd5718..657c13362b19 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -643,18 +643,19 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue > *dev_queue, > struct Qdisc *sch; > > if (!try_module_get(ops->owner)) > - goto errout; > + return NULL; > > sch = qdisc_alloc(dev_queue, ops); > - if (IS_ERR(sch)) > - goto errout; > + if (IS_ERR(sch)) { > + module_put(ops->owner); > + return NULL; > + } > sch->parent = parentid; > > if (!ops->init || ops->init(sch, NULL) == 0) > return sch; > > qdisc_destroy(sch); > -errout: > return NULL; > } > EXPORT_SYMBOL(qdisc_create_dflt); > > Thanks! Acked-by: John Fastabend
Re: [PATCH net-next 0/2] rxrpc: More fixes
From: David Howells Date: Wed, 24 Aug 2016 15:59:46 +0100 > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160824-1 Both -1 and -2 pulled, thanks David!
[PATCH net] qdisc: fix a module refcount leak in qdisc_create_dflt()
From: Eric Dumazet Should qdisc_alloc() fail, we must release the module refcount we got right before. Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") Signed-off-by: Eric Dumazet --- net/sched/sch_generic.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index e95b67cd5718..657c13362b19 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -643,18 +643,19 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue, struct Qdisc *sch; if (!try_module_get(ops->owner)) - goto errout; + return NULL; sch = qdisc_alloc(dev_queue, ops); - if (IS_ERR(sch)) - goto errout; + if (IS_ERR(sch)) { + module_put(ops->owner); + return NULL; + } sch->parent = parentid; if (!ops->init || ops->init(sch, NULL) == 0) return sch; qdisc_destroy(sch); -errout: return NULL; } EXPORT_SYMBOL(qdisc_create_dflt);
Re: [for-next 00/15][PULL request] Mellanox mlx5 core driver updates 2016-08-24
From: Saeed Mahameed Date: Wed, 24 Aug 2016 13:38:59 +0300 > This series contains some low level and API updates for mlx5 core > driver interface and mlx5_ifc.h, plus mlx5 LAG core driver support, > to be shared as base code for net-next and rdma mlx5 4.9 submissions. Pulled, thanks.
Re: [patch net 0/2] mlxsw: couple of fixes
From: Jiri Pirko Date: Wed, 24 Aug 2016 11:18:50 +0200 > Ido Schimmel (1): > mlxsw: spectrum: Add missing flood to router port > > Yotam Gigi (1): > mlxsw: router: Enable neighbors to be created on stacked devices Both applied, thanks Jiri.
Re: [PATCH net-next] bnx2x: Don't flush multicast MACs
From: Yuval Mintz Date: Wed, 24 Aug 2016 13:27:19 +0300 > When ndo_set_rx_mode() is called for bnx2x, as part of process of > configuring the new MAC address filters [both unicast & multicast] > driver begins by flushing the existing configuration and then iterating > over the network device's list of addresses and configures those instead. > > This has the side-effect of creating a short gap where traffic wouldn't > be properly classified, as no filters are configured in HW. > While for unicasts this is rather insignificant [as unicast MACs don't > frequently change while interface is actually running], > for multicast traffic it does pose an issue as there are multicast-based > networks where new multicast groups would constantly be removed and > added. > > This patch tries to remedy this [at least for the newer adapters] - > Instead of flushing & reconfiguring all existing multicast filters, > the driver would instead create the approximate hash match that would > result from the required filters. It would then compare it against the > currently configured approximate hash match, and only add and remove the > delta between those. > > Signed-off-by: Yuval Mintz Applied.
Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO
On 8/24/16 10:28 AM, pravin shelar wrote: >> How do you feel about implementing the do_output() idea I suggested above? >> I'm happy to provide testing and review. > > I am not sure about changing do_output(). why not just use same scheme > to track mpls header in OVS datapath as done in mpls device? > was just replying with the same. Something like this should be able to handle multiple labels. The inner network header is set once and the outer one pointing to MPLS is adjusted each time a label is pushed: diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 1ecbd7715f6d..0f37b17e3a73 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -162,10 +162,16 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, if (skb_cow_head(skb, MPLS_HLEN) < 0) return -ENOMEM; + if (!skb->inner_protocol) { + skb_set_inner_network_header(skb, skb->mac_len); + skb_set_inner_protocol(skb, skb->protocol); + } + skb_push(skb, MPLS_HLEN); memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb), skb->mac_len); skb_reset_mac_header(skb); + skb_set_network_header(skb, skb->mac_len); new_mpls_lse = (__be32 *)skb_mpls_header(skb); *new_mpls_lse = mpls->mpls_lse; @@ -173,8 +179,7 @@ static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key, skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN); update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype); - if (!skb->inner_protocol) - skb_set_inner_protocol(skb, skb->protocol); + skb->protocol = mpls->mpls_ethertype; invalidate_flow_key(key); If it does, what else needs to be changed in OVS to handle the network layer now pointing to the MPLS labels?
Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats
On Tue, 2016-08-23 at 13:24 -0700, John Fastabend wrote: > Enable dflt qdisc support for per cpu stats before this patch a > dflt qdisc was required to use the global statistics qstats and > bstats. > > Signed-off-by: John Fastabend > --- > net/sched/sch_generic.c | 24 > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 80544c2..910b4d15 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -646,18 +646,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue > *dev_queue, > struct Qdisc *sch; > > if (!try_module_get(ops->owner)) > - goto errout; > + return NULL; > > sch = qdisc_alloc(dev_queue, ops); > if (IS_ERR(sch)) > - goto errout; > + return NULL; > sch->parent = parentid; > > - if (!ops->init || ops->init(sch, NULL) == 0) > + if (!ops->init) > return sch; > > - qdisc_destroy(sch); > + if (ops->init(sch, NULL)) > + goto errout; > + > + /* init() may have set percpu flags so init data structures */ > + if (qdisc_is_percpu_stats(sch)) { > + sch->cpu_bstats = > + netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu); > + if (!sch->cpu_bstats) > + goto errout; > + > + sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue); > + if (!sch->cpu_qstats) > + goto errout; > + } > + Why are you attempting these allocations here instead of qdisc_alloc() This looks weird, I would expect base qdisc being fully allocated before ops->init() is attempted.
Re: [patch net-next 0/7] mlxsw: Offload FDB learning configuration
From: Jiri Pirko Date: Wed, 24 Aug 2016 12:00:22 +0200 > From: Jiri Pirko > > Ido says: > This patchset addresses two long standing issues in the mlxsw driver > concerning FDB learning. > > Patch 1 limits the number of FDB records processed by the driver in a > single session. This is useful in situations in which many new records > need to be processed, thereby causing the RTNL mutex to be held for > long periods of time. > > Patches 2-6 offload the learning configuration (on / off) of bridge > ports to the device instead of having the driver decide whether a > record needs to be learned or not. > > The last patch is fallout and removes configuration no longer necessary > after the first patches are applied. Looks good, series applied, thanks!
Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO
On Wed, Aug 24, 2016 at 12:20 AM, Simon Horman wrote: > Hi David, > > On Tue, Aug 23, 2016 at 01:24:51PM -0600, David Ahern wrote: >> On 8/22/16 8:51 AM, Simon Horman wrote: >> > >> > The scheme that OvS uses so far is that mac_len denotes the number of bytes >> > from the start of the MAC header until its end. In the absence of MPLS that >> > will be the beginning of the network header. And in the presence of MPLS it >> > will be the beginning of the MPLS label stack. The network header is... the >> > network header. This allows the MAC header, MPLS label stack and network >> > header to be tracked. >> >> The neigh output functions do '__skb_pull(skb, skb_network_offset(skb))' so >> if mpls_xmit does not reset the network header the labels get dropped. To me >> this says MPLS labels can not be lumped with the mac header which leaves the >> only option as the outer network header. >> >> > >> > Pravin (CCed) may have different ideas but I wonder if the above scheme can >> > be preserved while also meeting the needs of your new MPLS GSO scheme if >> > you set skb_set_network_header() and skb_set_inner_network_header() in >> > net/openvswitch/actions.c:do_output(). >> > >> > It may also be possible to teach OvS to use skb_set_network_header to >> > denote the beginning of the MPLS LSE and skb_set_inner_network_header to >> > denote the network header in the presence of MPLS. Which is my current >> > understanding of what you are trying to achieve. But I think its likely >> > that I misunderstand things as it seems strange to me to pretend that an >> > MPLS LSE is a network header and the outer most network header is an inner >> > network header >> > >> >> This is the only option I can see working, but open to patches showing an >> alternative. > > On reflection I came to a similar conclusion. > >> I would like to get it resolved this week so I can move on to gso in the >> mpls forward case. > > How do you feel about implementing the do_output() idea I suggested above? > I'm happy to provide testing and review. I am not sure about changing do_output(). why not just use same scheme to track mpls header in OVS datapath as done in mpls device?
Re: [net-next PATCH 05/15] net: sched: a dflt qdisc may be used with per cpu stats
On Tue, 2016-08-23 at 13:24 -0700, John Fastabend wrote: > Enable dflt qdisc support for per cpu stats before this patch a > dflt qdisc was required to use the global statistics qstats and > bstats. > > Signed-off-by: John Fastabend > --- > net/sched/sch_generic.c | 24 > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 80544c2..910b4d15 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -646,18 +646,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue > *dev_queue, > struct Qdisc *sch; > > if (!try_module_get(ops->owner)) > - goto errout; > + return NULL; > > sch = qdisc_alloc(dev_queue, ops); > if (IS_ERR(sch)) > - goto errout; > + return NULL; > sch->parent = parentid; > > - if (!ops->init || ops->init(sch, NULL) == 0) > + if (!ops->init) > return sch; > > - qdisc_destroy(sch); > + if (ops->init(sch, NULL)) > + goto errout; > + > + /* init() may have set percpu flags so init data structures */ > + if (qdisc_is_percpu_stats(sch)) { > + sch->cpu_bstats = > + netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu); > + if (!sch->cpu_bstats) > + goto errout; > + > + sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue); > + if (!sch->cpu_qstats) > + goto errout; > + } > + > + return sch; > errout: > + qdisc_destroy(sch); > return NULL; > } > EXPORT_SYMBOL(qdisc_create_dflt); > Hmm... apparently we have bug here, added in 6da7c8fcbcbdb50ec ("qdisc: allow setting default queuing discipline") We do not undo the try_module_get() in case of an error. I will send a fix.
Re: [PATCH net-next v1] gso: Support partial splitting at the frag_list pointer
On Wed, Aug 24, 2016 at 2:32 AM, Steffen Klassert wrote: > On Tue, Aug 23, 2016 at 07:47:32AM -0700, Alexander Duyck wrote: >> On Mon, Aug 22, 2016 at 10:20 PM, Steffen Klassert >> wrote: >> > Since commit 8a29111c7 ("net: gro: allow to build full sized skb") >> > gro may build buffers with a frag_list. This can hurt forwarding >> > because most NICs can't offload such packets, they need to be >> > segmented in software. This patch splits buffers with a frag_list >> > at the frag_list pointer into buffers that can be TSO offloaded. >> > >> > Signed-off-by: Steffen Klassert >> > --- >> > net/core/skbuff.c | 89 >> > +- >> > net/ipv4/af_inet.c | 7 ++-- >> > net/ipv4/gre_offload.c | 7 +++- >> > net/ipv4/tcp_offload.c | 3 ++ >> > net/ipv4/udp_offload.c | 9 +++-- >> > net/ipv6/ip6_offload.c | 6 +++- >> > 6 files changed, 114 insertions(+), 7 deletions(-) >> > >> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> > index 3864b4b6..a614e9d 100644 >> > --- a/net/core/skbuff.c >> > +++ b/net/core/skbuff.c >> > @@ -3078,6 +3078,92 @@ struct sk_buff *skb_segment(struct sk_buff >> > *head_skb, >> > sg = !!(features & NETIF_F_SG); >> > csum = !!can_checksum_protocol(features, proto); >> > >> > + headroom = skb_headroom(head_skb); >> > + >> > + if (list_skb && net_gso_ok(features, >> > skb_shinfo(head_skb)->gso_type) && >> > + csum && sg && (mss != GSO_BY_FRAGS) && >> > + !(features & NETIF_F_GSO_PARTIAL)) { >> >> Does this really need to be mutually exclusive with >> NETIF_F_GSO_PARTIAL and GSO_BY_FRAGS? > > It should be possible to extend this to NETIF_F_GSO_PARTIAL but > I have no test for this. Regarding GSO_BY_FRAGS, this is rather > new and just used for sctp. I don't know what sctp does with > GSO_BY_FRAGS. I'm adding Marcelo as he could probably explain the GSO_BY_FRAGS functionality better than I could since he is the original author. If I recall GSO_BY_FRAGS does something similar to what you are doing, although I believe it doesn't carry any data in the first buffer other than just a header. I believe the idea behind GSO_BY_FRAGS was to allow for segmenting a frame at the frag_list level instead of having it done just based on MSS. That was the only reason why I brought it up. In you case though we maybe be able to make this easier. If I am not mistaken I believe we should have the main skb, and any in the chain excluding the last containing the same amount of data. That being the case we should be able to determine the size that you would need to segment at by taking skb->len, and removing the length of all the skbuffs hanging off of frag_list. At that point you just use that as your MSS for segmentation and it should break things up so that you have a series of equal sized segments split as the frag_list buffer boundaries. After that all that is left is to update the gso info for the buffers. For GSO_PARTIAL I was handling that on the first segment only. For this change you would need to update that code to address the fact that you would have to determine the number of segments on the first frame and the last since the last could be less than the first, but all of the others in-between should have the same number of segments. - Alex
[PATCH net-next] tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter
From: Eric Dumazet Adds SNMP counter for drops caused by MD5 mismatches. The current syslog might help, but a counter is more precise and helps monitoring. Signed-off-by: Eric Dumazet --- include/uapi/linux/snmp.h |1 + net/ipv4/proc.c |1 + net/ipv4/tcp_ipv4.c |1 + net/ipv6/tcp_ipv6.c |1 + 4 files changed, 4 insertions(+) diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h index 25a9ad8bcef1240915f2553a8acade447186d869..e7a31f8306903f53bc5881ae4c271f85cad2e361 100644 --- a/include/uapi/linux/snmp.h +++ b/include/uapi/linux/snmp.h @@ -235,6 +235,7 @@ enum LINUX_MIB_TCPSPURIOUSRTOS, /* TCPSpuriousRTOs */ LINUX_MIB_TCPMD5NOTFOUND, /* TCPMD5NotFound */ LINUX_MIB_TCPMD5UNEXPECTED, /* TCPMD5Unexpected */ + LINUX_MIB_TCPMD5FAILURE,/* TCPMD5Failure */ LINUX_MIB_SACKSHIFTED, LINUX_MIB_SACKMERGED, LINUX_MIB_SACKSHIFTFALLBACK, diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 9f665b63a927202b9aaf2b6b3d42205058a2ae59..1ed015e4bc792acdd520a5df95ffac33ebefc4db 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -257,6 +257,7 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM("TCPSpuriousRTOs", LINUX_MIB_TCPSPURIOUSRTOS), SNMP_MIB_ITEM("TCPMD5NotFound", LINUX_MIB_TCPMD5NOTFOUND), SNMP_MIB_ITEM("TCPMD5Unexpected", LINUX_MIB_TCPMD5UNEXPECTED), + SNMP_MIB_ITEM("TCPMD5Failure", LINUX_MIB_TCPMD5FAILURE), SNMP_MIB_ITEM("TCPSackShifted", LINUX_MIB_SACKSHIFTED), SNMP_MIB_ITEM("TCPSackMerged", LINUX_MIB_SACKMERGED), SNMP_MIB_ITEM("TCPSackShiftFallback", LINUX_MIB_SACKSHIFTFALLBACK), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 32b048e524d6773538918eca175b3f422f9c2aa7..45aac7ada13592c6f1c9f28aea4426b40520e0c8 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1169,6 +1169,7 @@ static bool tcp_v4_inbound_md5_hash(const struct sock *sk, NULL, skb); if (genhash || memcmp(hash_location, newhash, 16) != 0) { + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMD5FAILURE); net_info_ratelimited("MD5 Hash failed for (%pI4, %d)->(%pI4, %d)%s\n", &iph->saddr, ntohs(th->source), &iph->daddr, ntohs(th->dest), diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index e0f46439e391f2a8b2fac2e13b6f61a11c082715..60a65d058349c93fb66275434f6fe162a621782e 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -671,6 +671,7 @@ static bool tcp_v6_inbound_md5_hash(const struct sock *sk, NULL, skb); if (genhash || memcmp(hash_location, newhash, 16) != 0) { + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMD5FAILURE); net_info_ratelimited("MD5 Hash %s for [%pI6c]:%u->[%pI6c]:%u\n", genhash ? "failed" : "mismatch", &ip6h->saddr, ntohs(th->source),
Re: wan-cosa: Use memdup_user() rather than duplicating its implementation
> What about the GFP_DMA attribute, which your patch deletes? > The buffer in question has to be ISA DMA-able. Thanks for your constructive feedback. Would you be interested in using a variant of the function "memdup_…" with which the corresponding memory allocation option can be preserved? Regards, Markus
Re: [ethtool PATCH v4 0/4] Add support for QSFP+/QSFP28 Diagnostics and 25G/50G/100G port speeds
On Wed, Aug 24, 2016 at 04:29:22AM +, Yuval Mintz wrote: > > This patch series provides following support > > a) Reorganized fields based out of SFF-8024 fields i.e. Identifier/ > >Encoding/Connector types which are common across SFP/SFP+ (SFF-8472) > >and QSFP+/QSFP28 (SFF-8436/SFF-8636) modules into sff-common files. > > b) Support for diagnostics information for QSFP Plus/QSFP28 modules > >based on SFF-8436/SFF-8636 > > c) Supporting 25G/50G/100G speeds in supported/advertising fields > > d) Tested across various QSFP+/QSFP28 Copper/Optical modules > > > > Standards for QSFP+/QSFP28 > > a) QSFP+/QSFP28 - SFF 8636 Rev 2.7 dated January 26,2016 > > b) SFF-8024 Rev 4.0 dated May 31, 2016 > > > > v4: > > Sync ethtool-copy.h to kernel commit > > 89da45b8b5b2187734a11038b8593714f964ffd1 > > which includes support for 50G base SR2 > > What about the man-page? I can just apply your man page patch on top. John -- John W. LinvilleSomeday the world will need a hero, and you linvi...@tuxdriver.com might be all we have. Be ready.
Re: [PATCH 1/3 v2] net: smsc911x: augment device tree bindings
On Wednesday, August 24, 2016 2:59:40 PM CEST Linus Walleij wrote: > +- interrupts : Should contain the SMSC LAN > + interrupt line as cell 0, cell 1 is an OPTIONAL PME (power > + management event) interrupt that is able to wake up the host > + system with a 50ms pulse on network activity > + For generic bindings for interrupt controller parents, refer to > + interrupt-controller/interrupts.txt I think you should (slightly) reword this to avoid using the term "cell", which refers to a 32-bit word in the property, not the interrupt specifier that is often made up of two or three cells. Maybe something like - interrupts: one or two interrupt specifiers: - The first interrupt is the SMSC LAN interrupt line. - The second interrupt (if present) is the power management event ... Arnd