Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode
On 01/07/2016 10:43 AM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Tuesday, January 05, 2016 7:05 PM To: Tantilov, Emil S; Michal Kubecek; Jay Vosburgh Cc: vfal...@gmail.com; go...@cumulusnetworks.com; netdev@vger.kernel.org; Shteinbock, Boris (Wind River) Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode On 01/06/2016 09:26 AM, Tantilov, Emil S wrote: -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of zhuyj Sent: Monday, December 28, 2015 1:19 AM To: Michal Kubecek; Jay Vosburgh Cc: vfal...@gmail.com; go...@cumulusnetworks.com; netdev@vger.kernel.org; Shteinbock, Boris (Wind River) Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode On 12/28/2015 04:43 PM, Michal Kubecek wrote: On Thu, Dec 17, 2015 at 01:57:16PM -0800, Jay Vosburgh wrote: wrote: In 802.3ad mode, the speed and duplex is needed. But in some NIC, there is a time span between NIC up state and getting speed and duplex. As such, sometimes a slave in 802.3ad mode is in up state without speed and duplex. This will make bonding in 802.3ad mode can not work well. To make bonding driver be compatible with more NICs, it is necessary to restrict the up state in 802.3ad mode. What device is this? It seems a bit odd that an Ethernet device can be carrier up but not have the duplex and speed available. ... In general, though, bonding expects a speed or duplex change to be announced via a NETDEV_UPDATE or NETDEV_UP notifier, which would propagate to the 802.3ad logic. If the device here is going carrier up prior to having speed or duplex available, then maybe it should call netdev_state_change() when the duplex and speed are available, or delay calling netif_carrier_on(). I have encountered this problem (NIC having carrier on before being able to detect speed/duplex and driver not notifying when speed/duplex becomes available) with netxen cards earlier. But it was eventually fixed in the driver by commit 9d01412ae76f ("netxen: Fix link event handling.") so this example rather supports what you said. Michal Kubecek Thanks a lot. I checked the commit 9d01412ae76f ("netxen: Fix link event handling."). The symptoms are the same with mine. The root cause is different. In my problem, the root cause is that LINKS register[] can not provide link_up and link_speed at the same time. There is a time span between link_up and link_speed. The LINK_UP and LINK_SPEED bits in the LINKS register for ixgbe HW are updated simultaneously. Do you have any proof to show the delay you are referring to as I am sure our HW engineers would like to know about it. Sorry. I can not reproduce this problem locally. What I have is the feedback from the customer. So you are assuming that there is a delay due to the issue you are seeing? Settings for eth0: Supported ports: [ TP ] Supported link modes: 100baseT/Full 1000baseT/Full 1baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 100baseT/Full 1000baseT/Full 1baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: Unknown! Duplex: Unknown! (255) Port: Twisted Pair PHYAD: 0 Transceiver: external Auto-negotiation: on MDI-X: Unknown Supports Wake-on: d Wake-on: d Current message level: 0x0007 (7) drv probe link Link detected: yes The speed and the link state here are reported from different sources: Link detected: yes Comes from a netif_carrier_ok() check. This is done via ethtool_op_get_link(). Only the speed is reported through the LINKS register - that is why it is reported as "Unknown" - in other words link_up is false. This is a trace from the case where the bonding driver reports 0 Mbps: kworker/u48:1-27950 [010] 6493.084916: ixgbe_service_task: eth1: link_speed = 80, link_up = false kworker/u48:1-27950 [011] 6493.184894: ixgbe_service_task: eth1: link_speed = 80, link_up = false kworker/u48:1-27950 [000] 6494.439883: ixgbe_service_task: eth1: link_speed = 80, link_up = true kworker/u48:1-27950 [000] 6494.464204: ixgbe_service_task: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX kworker/0:2-1926 [000] 6494.464249: ixgbe_get_settings: eth1: link_speed = 80, link_up = false NetworkManager-3819 [008] 6494.464484: ixgbe_get_settings: eth1: link_speed = 80, link_up = false kworker/u48:1-27950 [007] 6494.496886: bond_mii_monitor: bond0: link status definitely up for interface eth1, 0 Mbps full duplex NetworkManager-3819 [008] 6494.496967: ixgbe_get_settings: eth1: link_speed = 80, link
Re: header conflict introduced by change to netfilter_ipv4/ip_tables.h
On Wed, Jan 06, 2016 at 09:20:07AM -0800, Stephen Hemminger wrote: > This commit breaks compilation of iproute2 with net-next. Ok, linux/if.h and libc net/if.h have overlapping defines, and this is not the only one. I saw lots of them in the core dump headers. How should we handle them? Another ifndef for IFNAMSIZ into kernel uapi headers? -Mikko > commit 1ffad83dffd675cd742286ae82dca7d746cb0da8 > Author: Mikko Rapeli > Date: Thu Oct 15 07:56:30 2015 +0200 > > netfilter: fix include files for compilation > > Add missing header dependencies and other small changes so that each file > compiles alone in userspace. > > Signed-off-by: Mikko Rapeli > Signed-off-by: Pablo Neira Ayuso > > For iproute2, a copy of kernel headers (from make install_headers) is used. > After this change. the build of x_tables.c fails because IFNAMSIZ is already > defined in net/if.h > > gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations > -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES > -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE > -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE > -DHAVE_SETNS -DHAVE_ELF -DCONFIG_GACT -DCONFIG_GACT_PROB > -DIPT_LIB_DIR=\"/lib/xtables\" -DYY_NO_INPUT -Wl,-export-dynamic -shared > -fpic -o q_atm.so q_atm.c -latm > gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations > -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES > -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE > -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE > -DHAVE_SETNS -DHAVE_ELF -DCONFIG_GACT -DCONFIG_GACT_PROB > -DIPT_LIB_DIR=\"/lib/xtables\" -DYY_NO_INPUT -Wl,-export-dynamic -shared > -fpic -o m_xt.so m_xt.c $(pkg-config xtables --cflags --libs) > In file included from ../include/linux/netfilter_ipv4/ip_tables.h:20:0, > from m_xt.c:20: > ../include/linux/if.h:26:0: warning: "IFNAMSIZ" redefined > #define IFNAMSIZ 16 > ^ > In file included from m_xt.c:17:0: > /usr/include/net/if.h:129:0: note: this is the location of the previous > definition > # define IFNAMSIZ IF_NAMESIZE > ^ > ../include/linux/if.h:71:2: error: redeclaration of enumerator ‘IFF_UP’ > IFF_UP= 1<<0, /* sysfs */ > ^ > /usr/include/net/if.h:44:5: note: previous definition of ‘IFF_UP’ was here > IFF_UP = 0x1, /* Interface is up. */ > ^ > ../include/linux/if.h:72:2: error: redeclaration of enumerator ‘IFF_BROADCAST’ > IFF_BROADCAST = 1<<1, /* __volatile__ */ > ^ > /usr/include/net/if.h:46:5: note: previous definition of ‘IFF_BROADCAST’ was > here > IFF_BROADCAST = 0x2, /* Broadcast address valid. */ > ^ > ../include/linux/if.h:73:2: error: redeclaration of enumerator ‘IFF_DEBUG’ > IFF_DEBUG = 1<<2, /* sysfs */ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN trace - skb_warn_bad_offload - vxlan - large udp packet - udp checksum disabled
On Thu, Jan 07, 2016 at 03:36:27AM +0100, Hannes Frederic Sowa wrote: > On 15.12.2015 02:35, Nelson, Shannon wrote: > >Using a slightly modified version of udpspam (see diff below - > >hopefully not mangled by corporate email servers), where I set the > >SO_NO_CHECK socket option and can specify a large buffer size, I can > >reliably get the following WARN trace. I have reproduced this on > >both ixgbe and i40e drivers using "udpspam-no-check > >6000". > > > >It looks to me like this is in the Tx path before we get to the > >actual NIC drivers, but I may be wrong. > > Does UFO offloading on the tunnel interface fix this error? > > ethtool -K vxlan ufo off > > If yes we can't feed SO_NO_CHECK udp packets into UFO as gso/ufo > requires CHECKSUM_PARTIAL (or add some more logic into the skb or > query socket status from skb_gso_segment). I believe you are right. This sounds like the issue addressed by commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM sockets") except for a regular SOCK_DGRAM socket with SO_NO_CHECK option. If so, changing both instances of the check (sk->sk_type == SOCK_DGRAM) added by this commit to (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx should fix the problem (and it should be done even if the issue reported here is caused by something else). Michal Kubecek -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode
Hi, Emil Would you like to help me to make tests with this patch? If the root cause is not the time span, I will make a new patch for this. Thanks a lot. Zhu Yanjun On 01/07/2016 02:15 PM, zyjzyj2...@gmail.com wrote: From: Zhu Yanjun In 802.3ad mode, the speed and duplex is needed. But in some NIC, there is a time span between NIC up state and getting speed and duplex. As such, sometimes a slave in 802.3ad mode is in up state without speed and duplex. This will make bonding in 802.3ad mode can not work well. To make bonding driver be compatible with more NICs, it is necessary to restrict the up state in 802.3ad mode. Signed-off-by: Zhu Yanjun --- drivers/net/bonding/bond_main.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 09f8a48..7df8af5 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1991,6 +1991,17 @@ static int bond_miimon_inspect(struct bonding *bond) link_state = bond_check_dev_link(bond, slave->dev, 0); + if ((BMSR_LSTATUS == link_state) && + (BOND_MODE(bond) == BOND_MODE_8023AD)) { + rtnl_lock(); + bond_update_speed_duplex(slave); + rtnl_unlock(); + if ((slave->speed == SPEED_UNKNOWN) || + (slave->duplex == DUPLEX_UNKNOWN)) { + link_state = 0; + netdev_info(bond->dev, "In 802.3ad mode, it is not enough to up without speed and duplex"); + } + } switch (slave->link) { case BOND_LINK_UP: if (link_state) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/1] bonding: restrict up state in 802.3ad mode
>-Original Message- >From: zhuyj [mailto:zyjzyj2...@gmail.com] >Sent: Wednesday, January 06, 2016 7:34 PM >To: Tantilov, Emil S; Michal Kubecek; Jay Vosburgh >Cc: vfal...@gmail.com; go...@cumulusnetworks.com; netdev@vger.kernel.org; >Shteinbock, Boris (Wind River) >Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode > >On 01/07/2016 10:43 AM, Tantilov, Emil S wrote: >>> -Original Message- >>> From: zhuyj [mailto:zyjzyj2...@gmail.com] >>> Sent: Tuesday, January 05, 2016 7:05 PM >>> To: Tantilov, Emil S; Michal Kubecek; Jay Vosburgh >>> Cc: vfal...@gmail.com; go...@cumulusnetworks.com; >netdev@vger.kernel.org; >>> Shteinbock, Boris (Wind River) >>> Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode >>> >>> On 01/06/2016 09:26 AM, Tantilov, Emil S wrote: > -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev- >ow...@vger.kernel.org] >>> On > Behalf Of zhuyj > Sent: Monday, December 28, 2015 1:19 AM > To: Michal Kubecek; Jay Vosburgh > Cc: vfal...@gmail.com; go...@cumulusnetworks.com; >>> netdev@vger.kernel.org; > Shteinbock, Boris (Wind River) > Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode > > On 12/28/2015 04:43 PM, Michal Kubecek wrote: >> On Thu, Dec 17, 2015 at 01:57:16PM -0800, Jay Vosburgh wrote: >>> wrote: In 802.3ad mode, the speed and duplex is needed. But in some NIC, there is a time span between NIC up state and getting speed and >>> duplex. As such, sometimes a slave in 802.3ad mode is in up state without speed and duplex. This will make bonding in 802.3ad mode can not work well. To make bonding driver be compatible with more NICs, it is necessary to restrict the up state in 802.3ad mode. >>> What device is this? It seems a bit odd that an Ethernet >device >>> can be carrier up but not have the duplex and speed available. >> ... >>> In general, though, bonding expects a speed or duplex change to >>> be announced via a NETDEV_UPDATE or NETDEV_UP notifier, which would >>> propagate to the 802.3ad logic. >>> >>> If the device here is going carrier up prior to having speed or >>> duplex available, then maybe it should call netdev_state_change() >when >>> the duplex and speed are available, or delay calling >>> netif_carrier_on(). >> I have encountered this problem (NIC having carrier on before being >>> able >> to detect speed/duplex and driver not notifying when speed/duplex >> becomes available) with netxen cards earlier. But it was eventually >> fixed in the driver by commit 9d01412ae76f ("netxen: Fix link event >> handling.") so this example rather supports what you said. >> >> Michal >>> Kubecek > Thanks a lot. > I checked the commit 9d01412ae76f ("netxen: Fix link event > handling."). The symptoms are the same with mine. > > The root cause is different. In my problem, the root cause is that >LINKS > register[] can not provide link_up and link_speed at the same time. > There is a time span between link_up and link_speed. The LINK_UP and LINK_SPEED bits in the LINKS register for ixgbe HW are >>> updated simultaneously. Do you have any proof to show the delay you are >referring >>> to as I am sure our HW engineers would like to know about it. >>> Sorry. I can not reproduce this problem locally. What I have is the >>> feedback from the customer. >> So you are assuming that there is a delay due to the issue you are >seeing? > >Sure. Before I get the further feedback from the customer, I can not >make further conclusion. >My patch is based on the feedback from the customer. Your patch is throwing an RTNL assertion warning: RTNL: assertion failed at net/core/ethtool.c (357) Looks like you may need to hold an RTNL lock for the slave before calling bond_update_speed_duplex(), though I am not sure if it's a good idea in general. Thanks, Emil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net, sched: add clsact qdisc
On Wed, Jan 06, 2016 at 02:00:56AM +0100, Daniel Borkmann wrote: > > I decided to extend the sch_ingress module with clsact functionality so > that commonly used code can be reused, the module is being aliased with > sch_clsact so that it can be auto-loaded properly. Alternative would have been > to add a flag when initializing ingress to alter its behaviour plus aliasing > to a different name (as it's more than just ingress). However, the first would > end up, based on the flag, choosing the new/old behaviour by calling different > function implementations to handle each anyway, the latter would require to > register ingress qdisc once again under different alias. So, this really begs > to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops > by its own that share callbacks used by both. ... > Signed-off-by: Daniel Borkmann we've been going back and forth on the design and this final approach presented seems to be the best, since pros outweigh the cons. Acked-by: Alexei Starovoitov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode
On 01/07/2016 10:43 AM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Tuesday, January 05, 2016 7:05 PM To: Tantilov, Emil S; Michal Kubecek; Jay Vosburgh Cc: vfal...@gmail.com; go...@cumulusnetworks.com; netdev@vger.kernel.org; Shteinbock, Boris (Wind River) Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode On 01/06/2016 09:26 AM, Tantilov, Emil S wrote: -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of zhuyj Sent: Monday, December 28, 2015 1:19 AM To: Michal Kubecek; Jay Vosburgh Cc: vfal...@gmail.com; go...@cumulusnetworks.com; netdev@vger.kernel.org; Shteinbock, Boris (Wind River) Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode On 12/28/2015 04:43 PM, Michal Kubecek wrote: On Thu, Dec 17, 2015 at 01:57:16PM -0800, Jay Vosburgh wrote: wrote: In 802.3ad mode, the speed and duplex is needed. But in some NIC, there is a time span between NIC up state and getting speed and duplex. As such, sometimes a slave in 802.3ad mode is in up state without speed and duplex. This will make bonding in 802.3ad mode can not work well. To make bonding driver be compatible with more NICs, it is necessary to restrict the up state in 802.3ad mode. What device is this? It seems a bit odd that an Ethernet device can be carrier up but not have the duplex and speed available. ... In general, though, bonding expects a speed or duplex change to be announced via a NETDEV_UPDATE or NETDEV_UP notifier, which would propagate to the 802.3ad logic. If the device here is going carrier up prior to having speed or duplex available, then maybe it should call netdev_state_change() when the duplex and speed are available, or delay calling netif_carrier_on(). I have encountered this problem (NIC having carrier on before being able to detect speed/duplex and driver not notifying when speed/duplex becomes available) with netxen cards earlier. But it was eventually fixed in the driver by commit 9d01412ae76f ("netxen: Fix link event handling.") so this example rather supports what you said. Michal Kubecek Thanks a lot. I checked the commit 9d01412ae76f ("netxen: Fix link event handling."). The symptoms are the same with mine. The root cause is different. In my problem, the root cause is that LINKS register[] can not provide link_up and link_speed at the same time. There is a time span between link_up and link_speed. The LINK_UP and LINK_SPEED bits in the LINKS register for ixgbe HW are updated simultaneously. Do you have any proof to show the delay you are referring to as I am sure our HW engineers would like to know about it. Sorry. I can not reproduce this problem locally. What I have is the feedback from the customer. So you are assuming that there is a delay due to the issue you are seeing? Sure. Before I get the further feedback from the customer, I can not make further conclusion. My patch is based on the feedback from the customer. Settings for eth0: Supported ports: [ TP ] Supported link modes: 100baseT/Full 1000baseT/Full 1baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 100baseT/Full 1000baseT/Full 1baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: Unknown! Duplex: Unknown! (255) Port: Twisted Pair PHYAD: 0 Transceiver: external Auto-negotiation: on MDI-X: Unknown Supports Wake-on: d Wake-on: d Current message level: 0x0007 (7) drv probe link Link detected: yes The speed and the link state here are reported from different sources: Sure. ixgbe_get_settings->hw->mac.ops.check_link(X540)->ixgbe_check_mac_link_generic In this function ixgbe_check_mac_link_generic, the register IXGBE_LINKS is checked. link_up and link_speed is got from this register. Link detected: yes Comes from a netif_carrier_ok() check. This is done via ethtool_op_get_link() Only the speed is reported through the LINKS register - that is why it is reported as "Unknown" - in other words link_up is false. Sorry. I do not agree with you. static inline bool netif_carrier_ok(const struct net_device *dev) { return !test_bit(__LINK_STATE_NOCARRIER, &dev->state); } netif_carrier_ok will check __LINK_STATE_NOCARRIER. This __LINK_STATE_NOCARRIER is set by netif_carrier_on. /** * netif_carrier_on - set carrier * @dev: network device * * Device has detected that carrier. */ void netif_carrier_on(struct net_device *dev) { if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) { if (dev->reg_state == NETREG_UNINITIALIZED)
Re: [PATCH net-next] bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions
Hi Daniel, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Daniel-Borkmann/bpf-add-skb_postpush_rcsum-and-fix-dev_forward_skb-occasions/20160107-090423 config: x86_64-lkp (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): In file included from include/net/sch_generic.h:8:0, from include/linux/filter.h:16, from include/net/sock.h:64, from include/net/inet_sock.h:27, from include/net/ip.h:30, from net/core/filter.c:34: net/core/filter.c: In function 'bpf_clone_redirect': >> net/core/filter.c:1530:19: error: 'struct sk_buff' has no member named >> 'tc_verd' if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) ^ include/uapi/linux/pkt_cls.h:11:25: note: in definition of macro '_TC_MAKE32' #define _TC_MAKE32(x) ((x)) ^ include/uapi/linux/pkt_cls.h:54:26: note: in expansion of macro '_TC_GETVALUE' #define G_TC_AT(x) _TC_GETVALUE(x,S_TC_AT,M_TC_AT) ^ net/core/filter.c:1530:7: note: in expansion of macro 'G_TC_AT' if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) ^ net/core/filter.c: In function 'skb_do_redirect': net/core/filter.c:1578:18: error: 'struct sk_buff' has no member named 'tc_verd' if (G_TC_AT(skb->tc_verd) & AT_INGRESS) ^ include/uapi/linux/pkt_cls.h:11:25: note: in definition of macro '_TC_MAKE32' #define _TC_MAKE32(x) ((x)) ^ include/uapi/linux/pkt_cls.h:54:26: note: in expansion of macro '_TC_GETVALUE' #define G_TC_AT(x) _TC_GETVALUE(x,S_TC_AT,M_TC_AT) ^ net/core/filter.c:1578:7: note: in expansion of macro 'G_TC_AT' if (G_TC_AT(skb->tc_verd) & AT_INGRESS) ^ vim +1530 net/core/filter.c 1524 1525 skb2 = skb_clone(skb, GFP_ATOMIC); 1526 if (unlikely(!skb2)) 1527 return -ENOMEM; 1528 1529 if (BPF_IS_REDIRECT_INGRESS(flags)) { > 1530 if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) 1531 skb_postpush_rcsum(skb2, skb_mac_header(skb2), 1532 skb2->mac_len); 1533 return dev_forward_skb(dev, skb2); --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64
On 07.01.2016 03:36, Tom Herbert wrote: On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa wrote: Hi Tom, On 05.01.2016 19:41, Tom Herbert wrote: --- /dev/null +++ b/arch/x86/lib/csum-partial_64.S @@ -0,0 +1,147 @@ +/* Copyright 2016 Tom Herbert + * + * Checksum partial calculation + * + * __wsum csum_partial(const void *buff, int len, __wsum sum) + * + * Computes the checksum of a memory block at buff, length len, + * and adds in "sum" (32-bit) + * + * Returns a 32-bit number suitable for feeding into itself + * or csum_tcpudp_magic + * + * Register usage: + * %rdi: argument 1, buff + * %rsi: argument 2, length + * %rdx: argument 3, add in value I think you forgot to carry-add-in the %rdx register. The assembly code replaces do_csum but not csum_partial. First instruction is: movl%edx, %eax /* Initialize with initial sum argument */ Ups, sorry. I only grepped through the source. Meanwhile my test also uses non-zero sums and shows that it is fine. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/1] bonding: restrict up state in 802.3ad mode
>-Original Message- >From: zhuyj [mailto:zyjzyj2...@gmail.com] >Sent: Tuesday, January 05, 2016 7:05 PM >To: Tantilov, Emil S; Michal Kubecek; Jay Vosburgh >Cc: vfal...@gmail.com; go...@cumulusnetworks.com; netdev@vger.kernel.org; >Shteinbock, Boris (Wind River) >Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode > >On 01/06/2016 09:26 AM, Tantilov, Emil S wrote: >>> -Original Message- >>> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] >On >>> Behalf Of zhuyj >>> Sent: Monday, December 28, 2015 1:19 AM >>> To: Michal Kubecek; Jay Vosburgh >>> Cc: vfal...@gmail.com; go...@cumulusnetworks.com; >netdev@vger.kernel.org; >>> Shteinbock, Boris (Wind River) >>> Subject: Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode >>> >>> On 12/28/2015 04:43 PM, Michal Kubecek wrote: On Thu, Dec 17, 2015 at 01:57:16PM -0800, Jay Vosburgh wrote: > wrote: >> In 802.3ad mode, the speed and duplex is needed. But in some NIC, >> there is a time span between NIC up state and getting speed and >duplex. >> As such, sometimes a slave in 802.3ad mode is in up state without >> speed and duplex. This will make bonding in 802.3ad mode can not >> work well. >> To make bonding driver be compatible with more NICs, it is >> necessary to restrict the up state in 802.3ad mode. > What device is this? It seems a bit odd that an Ethernet device > can be carrier up but not have the duplex and speed available. ... > In general, though, bonding expects a speed or duplex change to > be announced via a NETDEV_UPDATE or NETDEV_UP notifier, which would > propagate to the 802.3ad logic. > > If the device here is going carrier up prior to having speed or > duplex available, then maybe it should call netdev_state_change() when > the duplex and speed are available, or delay calling >netif_carrier_on(). I have encountered this problem (NIC having carrier on before being >able to detect speed/duplex and driver not notifying when speed/duplex becomes available) with netxen cards earlier. But it was eventually fixed in the driver by commit 9d01412ae76f ("netxen: Fix link event handling.") so this example rather supports what you said. Michal >Kubecek >>> Thanks a lot. >>> I checked the commit 9d01412ae76f ("netxen: Fix link event >>> handling."). The symptoms are the same with mine. >>> >>> The root cause is different. In my problem, the root cause is that LINKS >>> register[] can not provide link_up and link_speed at the same time. >>> There is a time span between link_up and link_speed. >> The LINK_UP and LINK_SPEED bits in the LINKS register for ixgbe HW are >updated >> simultaneously. Do you have any proof to show the delay you are referring >to >> as I am sure our HW engineers would like to know about it. >Sorry. I can not reproduce this problem locally. What I have is the >feedback from the customer. So you are assuming that there is a delay due to the issue you are seeing? >Settings for eth0: >Supported ports: [ TP ] >Supported link modes: 100baseT/Full >1000baseT/Full >1baseT/Full >Supported pause frame use: No >Supports auto-negotiation: Yes >Advertised link modes: 100baseT/Full >1000baseT/Full >1baseT/Full >Advertised pause frame use: No >Advertised auto-negotiation: Yes >Speed: Unknown! >Duplex: Unknown! (255) >Port: Twisted Pair >PHYAD: 0 >Transceiver: external >Auto-negotiation: on >MDI-X: Unknown >Supports Wake-on: d >Wake-on: d >Current message level: 0x0007 (7) > drv probe link >Link detected: yes The speed and the link state here are reported from different sources: >Link detected: yes Comes from a netif_carrier_ok() check. This is done via ethtool_op_get_link(). Only the speed is reported through the LINKS register - that is why it is reported as "Unknown" - in other words link_up is false. This is a trace from the case where the bonding driver reports 0 Mbps: kworker/u48:1-27950 [010] 6493.084916: ixgbe_service_task: eth1: link_speed = 80, link_up = false kworker/u48:1-27950 [011] 6493.184894: ixgbe_service_task: eth1: link_speed = 80, link_up = false kworker/u48:1-27950 [000] 6494.439883: ixgbe_service_task: eth1: link_speed = 80, link_up = true kworker/u48:1-27950 [000] 6494.464204: ixgbe_service_task: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX kworker/0:2-1926 [000] 6494.464249: ixgbe_get_settings: eth1: link_speed = 80, link_up = false NetworkManager-3819 [008] 6494.464484: ixgbe_get_settings: eth1: link_speed = 80, link_up = false kworker/u48:1-27950 [007] 6494.496886: bond_mii_monitor: bo
Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed
On 01/06/2016 11:30 PM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Tuesday, January 05, 2016 9:42 PM To: Tantilov, Emil S; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000-de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed On 12/31/2015 12:37 AM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Wednesday, December 30, 2015 12:20 AM To: Tantilov, Emil S; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000-de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed On 12/30/2015 02:55 PM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Tuesday, December 29, 2015 6:49 PM To: Tantilov, Emil S; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000-de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed On 12/30/2015 12:18 AM, Tantilov, Emil S wrote: -Original Message- From: Intel-wired-lan [mailto:intel-wired-lan- boun...@lists.osuosl.org] On Behalf Of zyjzyj2...@gmail.com Sent: Monday, December 28, 2015 6:32 PM To: Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000- de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed From: Zhu Yanjun When the X540 NIC acts as a slave of some virtual NICs, it is very important to synchronize link_up and link_speed, such as a bonding driver in 802.3ad mode. When X540 NIC acts as an independent interface, it is not necessary to synchronize link_up and link_speed. That is, the time span between link_up and link_speed is acceptable. What exactly do you mean by "time span between link_up and link_speed"? In the previous mail, I show you some ethtool logs. In these logs, there is some time with NIC up while speed is unknown. I think this "some time" is time span between link_up and link_speed. Please see the previous mail for details. Was this when reporting the link state from check_link() (reading the LINKS register) or reporting the adapter->link_speed? Where is it you think the de-synchronization occurs? When a NIC interface acts as a slave, a flag "IFF_SLAVE" is set in netdevice struct. Before we enter this function, we check IFF_SLAVE flag. If this flag is set, we continue to check link_speed. If not, this function is executed whether this link_speed is unknown or not. I can already see this in your patch. I was asking about the reason why your change is needed. an extreme example, let us assume this scenario: Is this the scenario you are trying to fix? Sure. If IFF_SLAVE is checked, this scenario will not happen. I already explained why this is not a valid scenario, but if you were able to set it up somehow I'd like to know how you did it Sorry. Why can we not disable auto-negotiate on ixgbe NIC ? To be honest, I am interested in it. This is related with hardware or driver? Thanks a lot. Zhu Yanjun If we are to enter ixgbe_watchdog_link_is_up() with unknown link this would be an issue regardless of whether the interface is a part of a bond or not, but you haven't provided any proof that this is the case. Do you have a dmesg log that shows ixgbe reporting unknown speed? Was your patch tested by the customer that reported this issue? Thanks, Emil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa wrote: > Hi Tom, > > On 05.01.2016 19:41, Tom Herbert wrote: >> >> --- /dev/null >> +++ b/arch/x86/lib/csum-partial_64.S >> @@ -0,0 +1,147 @@ >> +/* Copyright 2016 Tom Herbert >> + * >> + * Checksum partial calculation >> + * >> + * __wsum csum_partial(const void *buff, int len, __wsum sum) >> + * >> + * Computes the checksum of a memory block at buff, length len, >> + * and adds in "sum" (32-bit) >> + * >> + * Returns a 32-bit number suitable for feeding into itself >> + * or csum_tcpudp_magic >> + * >> + * Register usage: >> + * %rdi: argument 1, buff >> + * %rsi: argument 2, length >> + * %rdx: argument 3, add in value > > > I think you forgot to carry-add-in the %rdx register. > > The assembly code replaces do_csum but not csum_partial. > First instruction is: movl%edx, %eax /* Initialize with initial sum argument */ Tom > Bye, > Hannes > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN trace - skb_warn_bad_offload - vxlan - large udp packet - udp checksum disabled
On 15.12.2015 02:35, Nelson, Shannon wrote: Using a slightly modified version of udpspam (see diff below - hopefully not mangled by corporate email servers), where I set the SO_NO_CHECK socket option and can specify a large buffer size, I can reliably get the following WARN trace. I have reproduced this on both ixgbe and i40e drivers using "udpspam-no-check 6000". It looks to me like this is in the Tx path before we get to the actual NIC drivers, but I may be wrong. Does UFO offloading on the tunnel interface fix this error? ethtool -K vxlan ufo off If yes we can't feed SO_NO_CHECK udp packets into UFO as gso/ufo requires CHECKSUM_PARTIAL (or add some more logic into the skb or query socket status from skb_gso_segment). Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN trace - skb_warn_bad_offload - vxlan - large udp packet - udp checksum disabled
On Tue, 15 Dec 2015 01:35:27 + "Nelson, Shannon" wrote: > Using a slightly modified version of udpspam (see diff below - hopefully > not mangled by corporate email servers), where I set the SO_NO_CHECK > socket option and can specify a large buffer size, I can reliably get > the following WARN trace. I have reproduced this on both ixgbe and > i40e drivers using "udpspam-no-check 6000". > > It looks to me like this is in the Tx path before we get to the actual > NIC drivers, but I may be wrong. > > > [ 1757.644324] [ cut here ] > [ 1757.644333] WARNING: CPU: 22 PID: 5537 at net/core/dev.c:2423 > skb_warn_bad_offload+0x104/0x111() Was anyone able to take a look at this? There is a pretty clear reproducer in the original mail. I know SO_NO_CHECK is a lightly used option, but it does exist and we found a userspace program that tries to use it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed
On 01/06/2016 11:30 PM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Tuesday, January 05, 2016 9:42 PM To: Tantilov, Emil S; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000-de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed On 12/31/2015 12:37 AM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Wednesday, December 30, 2015 12:20 AM To: Tantilov, Emil S; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000-de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed On 12/30/2015 02:55 PM, Tantilov, Emil S wrote: -Original Message- From: zhuyj [mailto:zyjzyj2...@gmail.com] Sent: Tuesday, December 29, 2015 6:49 PM To: Tantilov, Emil S; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000-de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed On 12/30/2015 12:18 AM, Tantilov, Emil S wrote: -Original Message- From: Intel-wired-lan [mailto:intel-wired-lan- boun...@lists.osuosl.org] On Behalf Of zyjzyj2...@gmail.com Sent: Monday, December 28, 2015 6:32 PM To: Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Allan, Bruce W; Ronciak, John; Williams, Mitch A; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; e1000- de...@lists.sourceforge.net Cc: Viswanathan, Ven (Wind River); Shteinbock, Boris (Wind River); Bourg, Vincent (Wind River) Subject: [Intel-wired-lan] [PATCH 2/2] ixgbe: restrict synchronization of link_up and speed From: Zhu Yanjun When the X540 NIC acts as a slave of some virtual NICs, it is very important to synchronize link_up and link_speed, such as a bonding driver in 802.3ad mode. When X540 NIC acts as an independent interface, it is not necessary to synchronize link_up and link_speed. That is, the time span between link_up and link_speed is acceptable. What exactly do you mean by "time span between link_up and link_speed"? In the previous mail, I show you some ethtool logs. In these logs, there is some time with NIC up while speed is unknown. I think this "some time" is time span between link_up and link_speed. Please see the previous mail for details. Was this when reporting the link state from check_link() (reading the LINKS register) or reporting the adapter->link_speed? Where is it you think the de-synchronization occurs? When a NIC interface acts as a slave, a flag "IFF_SLAVE" is set in netdevice struct. Before we enter this function, we check IFF_SLAVE flag. If this flag is set, we continue to check link_speed. If not, this function is executed whether this link_speed is unknown or not. I can already see this in your patch. I was asking about the reason why your change is needed. an extreme example, let us assume this scenario: Is this the scenario you are trying to fix? Sure. If IFF_SLAVE is checked, this scenario will not happen. I already explained why this is not a valid scenario, but if you were able to set it up somehow I'd like to know how you did it If it is not a valid scenario, maybe there is something wrong with NIC driver/hardware. We should pay attention to it. Zhu Yanjun If we are to enter ixgbe_watchdog_link_is_up() with unknown link this would be an issue regardless of whether the interface is a part of a bond or not, but you haven't provided any proof that this is the case. Do you have a dmesg log that shows ixgbe reporting unknown speed? Was your patch tested by the customer that reported this issue? Thanks, Emil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions
On 07.01.2016 02:01, Daniel Borkmann wrote: +static inline void skb_postpush_rcsum(struct sk_buff *skb, + const void *start, unsigned int len) +{ + if (skb->ip_summed == CHECKSUM_COMPLETE) + skb->csum = csum_add(skb->csum, csum_partial(start, len, 0)); skb->csum = csum_partial(start, len, skb->csum); should work without calling carry-add twice, no? Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64
Hi Tom, On 05.01.2016 19:41, Tom Herbert wrote: --- /dev/null +++ b/arch/x86/lib/csum-partial_64.S @@ -0,0 +1,147 @@ +/* Copyright 2016 Tom Herbert + * + * Checksum partial calculation + * + * __wsum csum_partial(const void *buff, int len, __wsum sum) + * + * Computes the checksum of a memory block at buff, length len, + * and adds in "sum" (32-bit) + * + * Returns a 32-bit number suitable for feeding into itself + * or csum_tcpudp_magic + * + * Register usage: + * %rdi: argument 1, buff + * %rsi: argument 2, length + * %rdx: argument 3, add in value I think you forgot to carry-add-in the %rdx register. The assembly code replaces do_csum but not csum_partial. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v5 5/6] Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping
Hi Richard, This all sounds fine. Thanks for the feedback. I'll roll this into the next patchset. Chris On Tue, 05 Jan 2016 07:27:32 -0800, Richard Cochran wrote: On Mon, Jan 04, 2016 at 04:45:22AM -0800, Christopher S. Hall wrote: + case PTP_SYS_OFFSET_PRECISE: + if (!ptp->info->getsynctime) { + err = -EINVAL; -EOPNOTSUPP would be better here. + precise_offset.sys_real.sec = + div_u64_rem(ktime_to_ns(xtstamp.sys_realtime), + NSEC_PER_SEC, &rem); + precise_offset.sys_real.nsec = rem; How about this instead: ts = ktime_to_timespec64(xtstamp.sys_realtime); precise_offset.sys_real.sec = ts.tv_sec; precise_offset.sys_real.nsec = ts.tv_nsec; + precise_offset.sys_raw.sec = + div_u64_rem(ktime_to_ns(xtstamp.sys_monoraw), + NSEC_PER_SEC, &rem); + precise_offset.sys_raw.nsec = rem; + precise_offset.dev.sec = + div_u64_rem(ktime_to_ns(xtstamp.device), NSEC_PER_SEC, + &rem); + precise_offset.dev.nsec = rem; And for these as well. Thanks, Richard -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
On Wed, Jan 6, 2016 at 4:29 PM, David Wragg wrote: > Jesse Gross writes: >> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg wrote: >>> I'm certainly open to suggestions of better ways to solve the problem. >> >> One option is to simply set the MTU on the device from userspace. > > If that worked I wouldn't be submitting a patch. > > The MTU value of 1500 is not merely the default. It is also the maximum > allowed for a vxlan netdev not associated with an underlying netdev. If > you do e.g. "ip link set dev vxlan-6784 mtu 8950", where vxlan-6784 > was created by an ovs vport, it fails with EINVAL. > > The first patch of the two submitted removes that limit. I saw your first patch and I agree that it fixes a problem. I was referring to the second patch. >> The reality is that the code you're modifying is compatibility code. >> Maybe we should make this change to preserve the old behavior or old >> callers (although, again, it should not be just for VXLAN). But no new >> features or tunnel types will be supported in this manner. > > That's fine. Naturally, the ideal from our point of view is if the > compatibility code is fully compatible, so we don't have to make changes > on our side that involve different code paths for different kernel > versions. That's what my patches are intended to achieve. The intention is to be fully backwards compatible with existing software. If you want to take advantage of future functionality then, yes, you will need to change to the new model. I agree that behavior changed with existing compatibility code. I'm fine with your series as long as you generalize it to all tunnel types and not just VXLAN. Just be aware that you're going to have to find another solution long term. > Ok. But please try to be gentle on the poor souls who have to come up > with a single codebase that works on a range of kernel versions going > back a few years. I maintain a large program that needs to do this myself, so I am aware that it can be challenging. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about vrf-lite
>> >> is it right? > > > no. The above works fine for me. I literally copied and pasted all of the > commands except the master ones which were adapted to my setup -- eth9 and > eth11 for me instead of eth0 and eth1. tcpdump on N2, N3 show the right one > is receiving packets based on which 'ping -I vrf' is run. > > Do tables 5 and 6 have the right routes? Thanks, David; it is not VRF issue, it is my configuration issue about qemu; I am testing VRF on qemu, and the issue/solution is same as issue under below link https://lists.gnu.org/archive/html/qemu-discuss/2014-06/msg00059.html -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions
Add a small helper skb_postpush_rcsum() and fix up redirect locations that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects a proper csum that covers also Ethernet header, f.e. since 2c26d34bbcc0 ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we also do skb_postpull_rcsum() after pulling Ethernet header off via eth_type_trans(). When using eBPF in a netns setup f.e. with vxlan in collect metadata mode, I can trigger the following csum issue with an IPv6 setup: [ 505.144065] dummy1: hw csum failure [...] [ 505.144108] Call Trace: [ 505.144112][] dump_stack+0x44/0x5c [ 505.144134] [] netdev_rx_csum_fault+0x3a/0x40 [ 505.144142] [] __skb_checksum_complete+0xcf/0xe0 [ 505.144149] [] nf_ip6_checksum+0xb2/0x120 [ 505.144161] [] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6] [ 505.144170] [] ? ip6t_do_table+0x2fa/0x645 [ip6_tables] [ 505.144177] [] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6] [ 505.144189] [] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack] [ 505.144196] [] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6] [ 505.144204] [] nf_iterate+0x5d/0x70 [ 505.144210] [] nf_hook_slow+0x66/0xc0 [ 505.144218] [] ipv6_rcv+0x3f2/0x4f0 [ 505.144225] [] ? ip6_make_skb+0x1b0/0x1b0 [ 505.144232] [] __netif_receive_skb_core+0x36b/0x9a0 [ 505.144239] [] ? __netif_receive_skb+0x18/0x60 [ 505.144245] [] __netif_receive_skb+0x18/0x60 [ 505.144252] [] process_backlog+0x9f/0x140 [ 505.144259] [] net_rx_action+0x145/0x320 [...] What happens is that on ingress, we push Ethernet header back in, either from cls_bpf or right before skb_do_redirect(), but without updating csum. The "hw csum failure" can be fixed by using the new skb_postpush_rcsum() helper for the dev_forward_skb() case to correct the csum diff again. Fixes: 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") Signed-off-by: Daniel Borkmann --- ( Given -net pull req is already out and nobody seemed to have triggered this so far, net-next seems just fine imho. ) include/linux/skbuff.h | 7 +++ net/core/filter.c | 17 + 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6b6bd42..f8fad71 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2805,6 +2805,13 @@ static inline void skb_postpull_rcsum(struct sk_buff *skb, unsigned char *skb_pull_rcsum(struct sk_buff *skb, unsigned int len); +static inline void skb_postpush_rcsum(struct sk_buff *skb, + const void *start, unsigned int len) +{ + if (skb->ip_summed == CHECKSUM_COMPLETE) + skb->csum = csum_add(skb->csum, csum_partial(start, len, 0)); +} + /** * pskb_trim_rcsum - trim received skb and update checksum * @skb: buffer to trim diff --git a/net/core/filter.c b/net/core/filter.c index 35e6fed..c6160ce 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1368,8 +1368,9 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags) /* skb_store_bits cannot return -EFAULT here */ skb_store_bits(skb, offset, ptr, len); - if (BPF_RECOMPUTE_CSUM(flags) && skb->ip_summed == CHECKSUM_COMPLETE) - skb->csum = csum_add(skb->csum, csum_partial(ptr, len, 0)); + if (BPF_RECOMPUTE_CSUM(flags)) + skb_postpush_rcsum(skb, ptr, len); + return 0; } @@ -1525,8 +1526,12 @@ static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5) if (unlikely(!skb2)) return -ENOMEM; - if (BPF_IS_REDIRECT_INGRESS(flags)) + if (BPF_IS_REDIRECT_INGRESS(flags)) { + if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) + skb_postpush_rcsum(skb2, skb_mac_header(skb2), + skb2->mac_len); return dev_forward_skb(dev, skb2); + } skb2->dev = dev; skb_sender_cpu_clear(skb2); @@ -1569,8 +1574,12 @@ int skb_do_redirect(struct sk_buff *skb) return -EINVAL; } - if (BPF_IS_REDIRECT_INGRESS(ri->flags)) + if (BPF_IS_REDIRECT_INGRESS(ri->flags)) { + if (G_TC_AT(skb->tc_verd) & AT_INGRESS) + skb_postpush_rcsum(skb, skb_mac_header(skb), + skb->mac_len); return dev_forward_skb(dev, skb); + } skb->dev = dev; skb_sender_cpu_clear(skb); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
On Wed, Jan 6, 2016 at 4:14 PM, Hannes Frederic Sowa wrote: > Hi, > > > On 07.01.2016 00:57, Jesse Gross wrote: >> >> On Wed, Jan 6, 2016 at 3:25 PM, David Wragg wrote: >>> >>> David Miller writes: > > Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of > any size, constrained only by the ability to transmit the resulting > UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan > vports. These netdevs have an MTU, which limits the size of a packet > that can be successfully vxlan-encapsulated. The default value for > this MTU is 1500, which is awkwardly small, and leads to a conspicuous > change in behaviour for userspace. > > These two patches set the MTU on openvswitch-crated vxlan devices to > be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 > overhead), effectively restoring the behaviour prior to 4.3. In order > to accomplish this, the first patch removes the MTU constraint of 1500 > for vxlan netdevs without an underlying device. Is this really the right thing to do? >>> >>> >>> I'm certainly open to suggestions of better ways to solve the problem. >> >> >> One option is to simply set the MTU on the device from userspace. >> >> The reality is that the code you're modifying is compatibility code. >> Maybe we should make this change to preserve the old behavior for old >> callers (although, again, it should not be just for VXLAN). But no new >> features or tunnel types will be supported in this manner. >> >> New or updated userspace programs should work by simply creating and >> adding tunnel devices to OVS. That won't go through this path at all >> so you're going to need to find another approach in the near future in >> any case. > > > I don't see any other way as to make MTUs part of the flow if we want to > have correct ip_local_error notifications. And those must also work across > VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs > (hopefully with a correct source address, otherwise uRPF kills them before > reaching the applications) or do error signaling via virtio_net. I actually implemented this a long time ago and then there was some additional discussion on this about a year ago. I agree it's the right solution overall but it's not entirely clearly to me how to get the details correct. > Either the openvswitch user space can feed those information to the datapath > or the ovs dataplane can do a lookup on the outer ip address while filling > out the metadata_dst and caching it in the flow or we just keep the dst in > the flow anyway. So a net_device used by ovs has no real mtu anymore. I agree that the concept of MTU is much more complicated than a single number on a device, we just have to find the right way to model it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v3 8/8] geneve: break dependency to network drivers
Hi Jesse, On 07.01.2016 01:18, Jesse Gross wrote: On Wed, Jan 6, 2016 at 3:39 PM, Hannes Frederic Sowa wrote: Signed-off-by: Hannes Frederic Sowa --- drivers/net/geneve.c | 30 +++--- include/net/geneve.h | 7 +++ 2 files changed, 30 insertions(+), 7 deletions(-) Thanks a lot for going through all the drivers! You found more things than I thought. I noticed just a couple new things with this version but overall I think it is a good direction. diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 24b077a32c1c9c..5b8d728b1204be 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -1110,7 +1110,7 @@ static struct device_type geneve_type = { * supply the listening GENEVE udp ports. Callers are expected * to implement the ndo_add_geneve_port. */ -void geneve_get_rx_port(struct net_device *dev) +static void geneve_notify_refresh_netdev(struct net_device *dev) { struct net *net = dev_net(dev); struct geneve_net *gn = net_generic(net, geneve_net_id); Both this function and the corresponding one in VXLAN assume that the add port NDO is implemented and blindly dereference the pointer. Now that all protocols get refreshed at the same time, we should check first. There's also a comment above the function that says this, which we can remove now. Absolutely, will fix this in the next version. With the separate offloading types this wouldn't be possible. Thanks for seeing this! I actually had this in a follow-up patch because I want to call this on NETDEV_REGISTER, too, so we don't need to add the geneve/vxlan functions to the drivers in ndo_start anymore. I also think we can update the comments in netdevice.h for the VXLAN/Geneve add/remove NDOs to say that they are protected by RTNL. Will do. Thanks! diff --git a/include/net/geneve.h b/include/net/geneve.h index e6c23dc765f7ec..7d52077b72faa3 100644 --- a/include/net/geneve.h +++ b/include/net/geneve.h -#if IS_ENABLED(CONFIG_GENEVE) -void geneve_get_rx_port(struct net_device *netdev); -#else static inline void geneve_get_rx_port(struct net_device *netdev) { + call_netdevice_notifiers(NETDEV_REFRESH_OFFLOADS, netdev); } -#endif I think we should merge vxlan_get_rx_port() and geneve_get_rx_port() into a single generic function. For drivers that support both, they now have two calls to get notified for all offloaded ports. This actually can cause problems related to duplicates, similar to what you feared before. Agreed. I add comments to the offloading functions in netdevice.h so they need to be implemented in a idempotent way and review the drivers how they deal with it. Thanks for the review, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
Jesse Gross writes: > On Wed, Jan 6, 2016 at 3:25 PM, David Wragg wrote: >> I'm certainly open to suggestions of better ways to solve the problem. > > One option is to simply set the MTU on the device from userspace. If that worked I wouldn't be submitting a patch. The MTU value of 1500 is not merely the default. It is also the maximum allowed for a vxlan netdev not associated with an underlying netdev. If you do e.g. "ip link set dev vxlan-6784 mtu 8950", where vxlan-6784 was created by an ovs vport, it fails with EINVAL. The first patch of the two submitted removes that limit. > The reality is that the code you're modifying is compatibility code. > Maybe we should make this change to preserve the old behavior or old > callers (although, again, it should not be just for VXLAN). But no new > features or tunnel types will be supported in this manner. That's fine. Naturally, the ideal from our point of view is if the compatibility code is fully compatible, so we don't have to make changes on our side that involve different code paths for different kernel versions. That's what my patches are intended to achieve. But we can live with such changes on our side, as long as there is some reasonable way to do so. In the case of this vxlan MTU issue, there doesn't seem to be one. > New or updated userspace programs should work by simply creating and > adding tunnel devices to OVS. That won't go through this path at all > so you're going to need to find another approach in the near future in > any case. Ok. But please try to be gentle on the poor souls who have to come up with a single codebase that works on a range of kernel versions going back a few years. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v3 8/8] geneve: break dependency to network drivers
On Wed, Jan 6, 2016 at 3:39 PM, Hannes Frederic Sowa wrote: > Signed-off-by: Hannes Frederic Sowa > --- > drivers/net/geneve.c | 30 +++--- > include/net/geneve.h | 7 +++ > 2 files changed, 30 insertions(+), 7 deletions(-) Thanks a lot for going through all the drivers! You found more things than I thought. I noticed just a couple new things with this version but overall I think it is a good direction. > diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c > index 24b077a32c1c9c..5b8d728b1204be 100644 > --- a/drivers/net/geneve.c > +++ b/drivers/net/geneve.c > @@ -1110,7 +1110,7 @@ static struct device_type geneve_type = { > * supply the listening GENEVE udp ports. Callers are expected > * to implement the ndo_add_geneve_port. > */ > -void geneve_get_rx_port(struct net_device *dev) > +static void geneve_notify_refresh_netdev(struct net_device *dev) > { > struct net *net = dev_net(dev); > struct geneve_net *gn = net_generic(net, geneve_net_id); Both this function and the corresponding one in VXLAN assume that the add port NDO is implemented and blindly dereference the pointer. Now that all protocols get refreshed at the same time, we should check first. There's also a comment above the function that says this, which we can remove now. I also think we can update the comments in netdevice.h for the VXLAN/Geneve add/remove NDOs to say that they are protected by RTNL. > diff --git a/include/net/geneve.h b/include/net/geneve.h > index e6c23dc765f7ec..7d52077b72faa3 100644 > --- a/include/net/geneve.h > +++ b/include/net/geneve.h > -#if IS_ENABLED(CONFIG_GENEVE) > -void geneve_get_rx_port(struct net_device *netdev); > -#else > static inline void geneve_get_rx_port(struct net_device *netdev) > { > + call_netdevice_notifiers(NETDEV_REFRESH_OFFLOADS, netdev); > } > -#endif I think we should merge vxlan_get_rx_port() and geneve_get_rx_port() into a single generic function. For drivers that support both, they now have two calls to get notified for all offloaded ports. This actually can cause problems related to duplicates, similar to what you feared before. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
Hi, On 07.01.2016 00:57, Jesse Gross wrote: On Wed, Jan 6, 2016 at 3:25 PM, David Wragg wrote: David Miller writes: Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of any size, constrained only by the ability to transmit the resulting UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan vports. These netdevs have an MTU, which limits the size of a packet that can be successfully vxlan-encapsulated. The default value for this MTU is 1500, which is awkwardly small, and leads to a conspicuous change in behaviour for userspace. These two patches set the MTU on openvswitch-crated vxlan devices to be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 overhead), effectively restoring the behaviour prior to 4.3. In order to accomplish this, the first patch removes the MTU constraint of 1500 for vxlan netdevs without an underlying device. Is this really the right thing to do? I'm certainly open to suggestions of better ways to solve the problem. One option is to simply set the MTU on the device from userspace. The reality is that the code you're modifying is compatibility code. Maybe we should make this change to preserve the old behavior for old callers (although, again, it should not be just for VXLAN). But no new features or tunnel types will be supported in this manner. New or updated userspace programs should work by simply creating and adding tunnel devices to OVS. That won't go through this path at all so you're going to need to find another approach in the near future in any case. I don't see any other way as to make MTUs part of the flow if we want to have correct ip_local_error notifications. And those must also work across VMs, so openvswitch in quasi brouting mode would need to emit ICMP PtBs (hopefully with a correct source address, otherwise uRPF kills them before reaching the applications) or do error signaling via virtio_net. Either the openvswitch user space can feed those information to the datapath or the ovs dataplane can do a lookup on the outer ip address while filling out the metadata_dst and caching it in the flow or we just keep the dst in the flow anyway. So a net_device used by ovs has no real mtu anymore. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
From: Joe Perches Date: Wed, 06 Jan 2016 15:32:24 -0800 > On Thu, 2016-01-07 at 00:26 +0100, Bjørn Mork wrote: >> Joe Perches writes: >> > On Wed, 2016-01-06 at 16:33 -0500, David Miller wrote: >> > > A repeating pattern in drivers has become to use OF node information >> > > and, if not found, platform specific host information to extract the >> > > ethernet address for a given device. >> > [] >> > > diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h >> > [] >> > > @@ -485,3 +487,32 @@ static int __init eth_offload_init(void) >> > > } >> > > >> > > fs_initcall(eth_offload_init); >> > > + >> > > +unsigned char * __weak arch_get_platform_mac_address(void) >> > > +{ >> > > + return NULL; >> > >> > WARN_ON_ONCE ? >> >> That would prevent a driver from using this with additional fallback >> methods. For what reason? > > It's declared __weak and should be overridden by > an arch specific function. > > A NULL address would cause a fault when using > a function like copy_ether_addr. The caller checks for NULL, and this is the default implementation doing precisely what it is meant to do. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
On Wed, 2016-01-06 at 19:01 -0500, David Miller wrote: > The caller checks for NULL, and this is the default implementation > doing precisely what it is meant to do. Yeah, I noticed that later. Nevermind... cheers, Joe -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
On Wed, Jan 6, 2016 at 3:25 PM, David Wragg wrote: > David Miller writes: >>> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >>> any size, constrained only by the ability to transmit the resulting >>> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >>> vports. These netdevs have an MTU, which limits the size of a packet >>> that can be successfully vxlan-encapsulated. The default value for >>> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >>> change in behaviour for userspace. >>> >>> These two patches set the MTU on openvswitch-crated vxlan devices to >>> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >>> overhead), effectively restoring the behaviour prior to 4.3. In order >>> to accomplish this, the first patch removes the MTU constraint of 1500 >>> for vxlan netdevs without an underlying device. >> >> Is this really the right thing to do? > > I'm certainly open to suggestions of better ways to solve the problem. One option is to simply set the MTU on the device from userspace. The reality is that the code you're modifying is compatibility code. Maybe we should make this change to preserve the old behavior for old callers (although, again, it should not be just for VXLAN). But no new features or tunnel types will be supported in this manner. New or updated userspace programs should work by simply creating and adding tunnel devices to OVS. That won't go through this path at all so you're going to need to find another approach in the near future in any case. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
Florian Westphal wrote: > Thadeu Lima de Souza Cascardo wrote: > > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote: [ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked on segmented skbs ] > > I have hit this as well, this fixes it for me on an older kernel. Can you > > try it > > on latest kernel? > > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > > index d8a1745..f44bc91 100644 > > --- a/net/ipv4/ip_output.c > > +++ b/net/ipv4/ip_output.c > > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb) > > netdev_features_t features; > > struct sk_buff *segs; > > int ret = 0; > > + struct inet_skb_parm ipcb; > > > > if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb)) > > return ip_finish_output2(skb); > > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb) > > * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly > > * from host network stack. > > */ > > + /* We need to save IPCB here because skb_gso_segment will use > > +* SKB_GSO_CB. > > +*/ > > + ipcb = *IPCB(skb); > > features = netif_skb_features(skb); > > segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); > > if (IS_ERR_OR_NULL(segs)) { > > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb) > > int err; > > > > segs->next = NULL; > > + *IPCB(segs) = ipcb; > > err = ip_fragment(segs, ip_finish_output2); > > > > if (err && ret == 0) > > I'm worried that this doesn't solve all cases. f.e. xfrm may also > call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter > postrouting + ipv4 output functions... > > nfqnl_enqueue_packet() is also affected. ... but it seems that those three are the only affected callers of skb_gso_segment (tbf is ok since skb isn't owned by anyone, ovs does save/restore already). I think this patch is the right way, we just need similar save/restore in nfqnl_enqueue_packet and xfrm_output_gso(). The latter two can be used by either ipv4 or ipv6 so it might be preferable to just save/restore sizeof(struct skb_gso_cb); or a union of inet_skb_parm+inet6_skb_parm. Cascardo, can you cook a patch? Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
Joe Perches writes: > On Thu, 2016-01-07 at 00:26 +0100, Bjørn Mork wrote: >> Joe Perches writes: >> > On Wed, 2016-01-06 at 16:33 -0500, David Miller wrote: >> > > A repeating pattern in drivers has become to use OF node information >> > > and, if not found, platform specific host information to extract the >> > > ethernet address for a given device. >> > [] >> > > diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h >> > [] >> > > @@ -485,3 +487,32 @@ static int __init eth_offload_init(void) >> > > } >> > > >> > > fs_initcall(eth_offload_init); >> > > + >> > > +unsigned char * __weak arch_get_platform_mac_address(void) >> > > +{ >> > > + return NULL; >> > >> > WARN_ON_ONCE ? >> >> That would prevent a driver from using this with additional fallback >> methods. For what reason? > > It's declared __weak and should be overridden by > an arch specific function. "should"? Why? Do you have a suggested implementation for - say - x86? > A NULL address would cause a fault when using > a function like copy_ether_addr. You should not do that then :) Bjørn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 4/8] benet: add rtnl lock protection around be_open in be_resume
Cc: Sathya Perla Cc: Ajit Khaparde Cc: Padmanabh Ratnakar Cc: Sriharsha Basavapatna Signed-off-by: Hannes Frederic Sowa --- drivers/net/ethernet/emulex/benet/be_main.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index f99de3657ce3b5..3200f48ddd5a68 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -4846,11 +4846,12 @@ static int be_resume(struct be_adapter *adapter) if (status) return status; - if (netif_running(netdev)) { + rtnl_lock(); + if (netif_running(netdev)) status = be_open(netdev); - if (status) - return status; - } + rtnl_unlock(); + if (status) + return status; netif_device_attach(netdev); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 1/8] qlcnic: protect qlcnic_82xx_io_slot_reset with rtnl lock
Cc: dept-gelinuxnic...@qlogic.com Signed-off-by: Hannes Frederic Sowa --- drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c index 1205f6f9c94173..1c29105b6c364f 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c @@ -3952,8 +3952,14 @@ static pci_ers_result_t qlcnic_82xx_io_error_detected(struct pci_dev *pdev, static pci_ers_result_t qlcnic_82xx_io_slot_reset(struct pci_dev *pdev) { - return qlcnic_attach_func(pdev) ? PCI_ERS_RESULT_DISCONNECT : - PCI_ERS_RESULT_RECOVERED; + pci_ers_result_t res; + + rtnl_lock(); + res = qlcnic_attach_func(pdev) ? PCI_ERS_RESULT_DISCONNECT : +PCI_ERS_RESULT_RECOVERED; + rtnl_unlock(); + + return res; } static void qlcnic_82xx_io_resume(struct pci_dev *pdev) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 5/8] fm10k: add rtnl lock protection in fm10k_io_resume
Cc: Jeff Kirsher Signed-off-by: Hannes Frederic Sowa --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index 4eb7a6fa6b0ddc..846d9b731c8c14 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -2350,8 +2350,10 @@ static void fm10k_io_resume(struct pci_dev *pdev) /* reset clock */ fm10k_ts_reset(interface); + rtnl_lock(); if (netif_running(netdev)) err = fm10k_open(netdev); + rtnl_lock(); /* final check of hardware state before registering the interface */ err = err ? : fm10k_hw_ready(interface); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 8/8] geneve: break dependency to network drivers
Signed-off-by: Hannes Frederic Sowa --- drivers/net/geneve.c | 30 +++--- include/net/geneve.h | 7 +++ 2 files changed, 30 insertions(+), 7 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 24b077a32c1c9c..5b8d728b1204be 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -1110,7 +1110,7 @@ static struct device_type geneve_type = { * supply the listening GENEVE udp ports. Callers are expected * to implement the ndo_add_geneve_port. */ -void geneve_get_rx_port(struct net_device *dev) +static void geneve_notify_refresh_netdev(struct net_device *dev) { struct net *net = dev_net(dev); struct geneve_net *gn = net_generic(net, geneve_net_id); @@ -1128,7 +1128,6 @@ void geneve_get_rx_port(struct net_device *dev) } rcu_read_unlock(); } -EXPORT_SYMBOL_GPL(geneve_get_rx_port); /* Initialize the device structure. */ static void geneve_setup(struct net_device *dev) @@ -1450,6 +1449,24 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name, } EXPORT_SYMBOL_GPL(geneve_dev_create_fb); +static int geneve_notifier(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = netdev_notifier_info_to_dev(ptr); + + switch (event) { + case NETDEV_REFRESH_OFFLOADS: + geneve_notify_refresh_netdev(dev); + break; + } + + return NOTIFY_DONE; +} + +static struct notifier_block geneve_notifier_block __read_mostly = { + .notifier_call = geneve_notifier, +}; + static __net_init int geneve_init_net(struct net *net) { struct geneve_net *gn = net_generic(net, geneve_net_id); @@ -1502,11 +1519,17 @@ static int __init geneve_init_module(void) if (rc) goto out1; - rc = rtnl_link_register(&geneve_link_ops); + rc = register_netdevice_notifier(&geneve_notifier_block); if (rc) goto out2; + rc = rtnl_link_register(&geneve_link_ops); + if (rc) + goto out3; + return 0; +out3: + unregister_netdevice_notifier(&geneve_notifier_block); out2: unregister_pernet_subsys(&geneve_net_ops); out1: @@ -1517,6 +1540,7 @@ late_initcall(geneve_init_module); static void __exit geneve_cleanup_module(void) { rtnl_link_unregister(&geneve_link_ops); + unregister_netdevice_notifier(&geneve_notifier_block); unregister_pernet_subsys(&geneve_net_ops); } module_exit(geneve_cleanup_module); diff --git a/include/net/geneve.h b/include/net/geneve.h index e6c23dc765f7ec..7d52077b72faa3 100644 --- a/include/net/geneve.h +++ b/include/net/geneve.h @@ -1,6 +1,8 @@ #ifndef __NET_GENEVE_H #define __NET_GENEVE_H 1 +#include + #ifdef CONFIG_INET #include #endif @@ -62,13 +64,10 @@ struct genevehdr { struct geneve_opt options[]; }; -#if IS_ENABLED(CONFIG_GENEVE) -void geneve_get_rx_port(struct net_device *netdev); -#else static inline void geneve_get_rx_port(struct net_device *netdev) { + call_netdevice_notifiers(NETDEV_REFRESH_OFFLOADS, netdev); } -#endif #ifdef CONFIG_INET struct net_device *geneve_dev_create_fb(struct net *net, const char *name, -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 3/8] ixgbe: add rtnl locking in service task around vxlan_get_rx_port
Protect IXGBE_FLAG2_VXLAN_REREG_NEEDED flag with rtnl_lock as it seems it could be concurrently modified by ixgbe_set_features and ixgbe_del_vxlan_port. Cc: Jeff Kirsher Signed-off-by: Hannes Frederic Sowa --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index ea9537d0e63a9d..8bef07196c804d 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -7115,10 +7115,12 @@ static void ixgbe_service_task(struct work_struct *work) return; } #ifdef CONFIG_IXGBE_VXLAN + rtnl_lock(); if (adapter->flags2 & IXGBE_FLAG2_VXLAN_REREG_NEEDED) { adapter->flags2 &= ~IXGBE_FLAG2_VXLAN_REREG_NEEDED; vxlan_get_rx_port(adapter->netdev); } + rtnl_unlock(); #endif /* CONFIG_IXGBE_VXLAN */ ixgbe_reset_subtask(adapter); ixgbe_phy_interrupt_subtask(adapter); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 7/8] vxlan: break dependency to network drivers
Signed-off-by: Hannes Frederic Sowa --- drivers/net/vxlan.c | 17 +++-- include/net/vxlan.h | 5 + 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index fecf7b6c732e96..10f1304d58e1ea 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2468,7 +2468,7 @@ static struct device_type vxlan_type = { * supply the listening VXLAN udp ports. Callers are expected * to implement the ndo_add_vxlan_port. */ -void vxlan_get_rx_port(struct net_device *dev) +static void vxlan_notify_refresh_netdev(struct net_device *dev) { struct vxlan_sock *vs; struct net *net = dev_net(dev); @@ -2488,7 +2488,6 @@ void vxlan_get_rx_port(struct net_device *dev) } spin_unlock(&vn->sock_lock); } -EXPORT_SYMBOL_GPL(vxlan_get_rx_port); /* Initialize the device structure. */ static void vxlan_setup(struct net_device *dev) @@ -3164,20 +3163,26 @@ static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn, unregister_netdevice_many(&list_kill); } -static int vxlan_lowerdev_event(struct notifier_block *unused, - unsigned long event, void *ptr) +static int vxlan_notifier(struct notifier_block *unused, + unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id); - if (event == NETDEV_UNREGISTER) + switch (event) { + case NETDEV_REFRESH_OFFLOADS: + vxlan_notify_refresh_netdev(dev); + break; + case NETDEV_UNREGISTER: vxlan_handle_lowerdev_unregister(vn, dev); + break; + } return NOTIFY_DONE; } static struct notifier_block vxlan_notifier_block __read_mostly = { - .notifier_call = vxlan_lowerdev_event, + .notifier_call = vxlan_notifier, }; static __net_init int vxlan_init_net(struct net *net) diff --git a/include/net/vxlan.h b/include/net/vxlan.h index 0fb86442544b26..48d0450160c91f 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -242,13 +242,10 @@ static inline netdev_features_t vxlan_features_check(struct sk_buff *skb, /* IPv6 header + UDP + VXLAN + Ethernet header */ #define VXLAN6_HEADROOM (40 + 8 + 8 + 14) -#if IS_ENABLED(CONFIG_VXLAN) -void vxlan_get_rx_port(struct net_device *netdev); -#else static inline void vxlan_get_rx_port(struct net_device *netdev) { + call_netdevice_notifiers(NETDEV_REFRESH_OFFLOADS, netdev); } -#endif static inline unsigned short vxlan_get_sk_family(struct vxlan_sock *vs) { -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 6/8] netdev: add netdevice notifier type to trigger a reprogramming of offloads
Signed-off-by: Hannes Frederic Sowa --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8d8e5ca951b493..9750e46760695d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2183,6 +2183,7 @@ struct netdev_lag_lower_state_info { #define NETDEV_BONDING_INFO0x0019 #define NETDEV_PRECHANGEUPPER 0x001A #define NETDEV_CHANGELOWERSTATE0x001B +#define NETDEV_REFRESH_OFFLOADS0x001C int register_netdevice_notifier(struct notifier_block *nb); int unregister_netdevice_notifier(struct notifier_block *nb); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 2/8] mlx4: add rtnl lock protection in mlx4_en_restart
I don't really understand the use of the state_lock. Can't it be simply replaced by rtnl_lock? It seems a lot of current users depend on rtnl_lock anyway. Anyway, fix this up for the moment. Cc: Eugenia Emantayev Signed-off-by: Hannes Frederic Sowa --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 0c7e3f69a73bb6..94abd0843901cf 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -1846,6 +1846,7 @@ static void mlx4_en_restart(struct work_struct *work) en_dbg(DRV, priv, "Watchdog task called for port %d\n", priv->port); + rtnl_lock(); mutex_lock(&mdev->state_lock); if (priv->port_up) { mlx4_en_stop_port(dev, 1); @@ -1853,6 +1854,7 @@ static void mlx4_en_restart(struct work_struct *work) en_err(priv, "Failed restarting port %d\n", priv->port); } mutex_unlock(&mdev->state_lock); + rtnl_unlock(); } static void mlx4_en_clear_stats(struct net_device *dev) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v3 0/8] net: break dependency of drivers on geneve and vxlan
Device drivers which support geneve or vxlan offloading have a dependency on the correlating tunnel kernel modules. Thus those drivers automatically load the geneve or vxlan modules. Break this dependency with this small series. Additionally this series features a review of the respective ->ndo_open and other functions around vxlan_get_rx_port and geneve_get_rx_port. * Result: $ cd drivers/net/ethernet/ $ find . -name '*.ko' | xargs modinfo | egrep '^depends:.*(vxlan|geneve)' | wc -l 0 I also incorporated feedback from Jesse Gross to only use one new netdevice notifiers type, namely NETDEV_REFRESH_OFFLOADS. Otherwise this series is very much the same as v1. Hannes Frederic Sowa (8): qlcnic: protect qlcnic_82xx_io_slot_reset with rtnl lock mlx4: add rtnl lock protection in mlx4_en_restart ixgbe: add rtnl locking in service task around vxlan_get_rx_port benet: add rtnl lock protection around be_open in be_resume fm10k: add rtnl lock protection in fm10k_io_resume netdev: add netdevice notifier type to trigger a reprogramming of offloads vxlan: break dependency to network drivers geneve: break dependency to network drivers drivers/net/ethernet/emulex/benet/be_main.c | 9 +++ drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 2 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 2 ++ drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 ++ drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 10 ++-- drivers/net/geneve.c | 30 +--- drivers/net/vxlan.c | 17 +- include/linux/netdevice.h| 1 + include/net/geneve.h | 7 +++--- include/net/vxlan.h | 5 +--- 10 files changed, 62 insertions(+), 23 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
On Thu, 2016-01-07 at 00:26 +0100, Bjørn Mork wrote: > Joe Perches writes: > > On Wed, 2016-01-06 at 16:33 -0500, David Miller wrote: > > > A repeating pattern in drivers has become to use OF node information > > > and, if not found, platform specific host information to extract the > > > ethernet address for a given device. > > [] > > > diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h > > [] > > > @@ -485,3 +487,32 @@ static int __init eth_offload_init(void) > > > } > > > > > > fs_initcall(eth_offload_init); > > > + > > > +unsigned char * __weak arch_get_platform_mac_address(void) > > > +{ > > > + return NULL; > > > > WARN_ON_ONCE ? > > That would prevent a driver from using this with additional fallback > methods. For what reason? It's declared __weak and should be overridden by an arch specific function. A NULL address would cause a fault when using a function like copy_ether_addr. > I don't have a specific usecase, but I can imagine drivers falling back > to e.g a random address without wanting to be noisy about it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
Joe Perches writes: > On Wed, 2016-01-06 at 16:33 -0500, David Miller wrote: >> A repeating pattern in drivers has become to use OF node information >> and, if not found, platform specific host information to extract the >> ethernet address for a given device. > [] >> diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h > [] >> @@ -485,3 +487,32 @@ static int __init eth_offload_init(void) >> } >> >> fs_initcall(eth_offload_init); >> + >> +unsigned char * __weak arch_get_platform_mac_address(void) >> +{ >> + return NULL; > > WARN_ON_ONCE ? That would prevent a driver from using this with additional fallback methods. For what reason? I don't have a specific usecase, but I can imagine drivers falling back to e.g a random address without wanting to be noisy about it. Bjørn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
David Miller writes: >> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >> any size, constrained only by the ability to transmit the resulting >> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >> vports. These netdevs have an MTU, which limits the size of a packet >> that can be successfully vxlan-encapsulated. The default value for >> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >> change in behaviour for userspace. >> >> These two patches set the MTU on openvswitch-crated vxlan devices to >> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >> overhead), effectively restoring the behaviour prior to 4.3. In order >> to accomplish this, the first patch removes the MTU constraint of 1500 >> for vxlan netdevs without an underlying device. > > Is this really the right thing to do? I'm certainly open to suggestions of better ways to solve the problem. To be clear, the problem from our perspective is that a use of the kernel openvswitch that worked fine in 4.2 and earlier is hobbled in 4.3. Previously the MTU of an openvswitch-based vxlan overlay network was constrained only by the MTU of the physical network. In 4.3, we can't take advantage of physical networks that support jumbo frames, causing a huge hit to throughput across the overlay network. The specific limit of 1500 seems very arbitrary. For a vxlan overlay network on top of a traditional ethernet network, the "correct" MTU for the vxlan netdevs is 1450 rather than 1500. And in general with openvswitch, the destination for vxlan packets is determined on a packet-by-packet basis, possibly involving different path MTUs of the underlying network. There is no single "correct" value. > Won't we get a lot of fragmentation > by using such a large MTU, especially since you're making it the default > for OVS setups? In the context of the openvswitch vxlan vport transmit path, I can't find a place where the dev->mtu is used (and it would be surprising, on the basis that the relevant parts of vxlan.c have not changed that much since 4.2, when no netdev was involved in that path). Considering non-openvswitch scenarios, when using vxlan netdevs directly, a vxlan netdev locked to an underlying device supporting jumbo frames can use a larger MTU. It's only vxlan netdevs without an underlying device that have the limit of 1500 imposed. But why shouldn't there be the same flexibility to select an MTU for best performance in both cases? Aren't the fragmentation concerns the same? > Things like path MTU discovery hinge strongly upon accurate MTU settings. > Otherwise they won't function properly. True. But in what sense is 1500 accurate? Uses/users of the kernel openvswitch code have always had to get this right, making sure that the MTU set on a vxlan overlay network conforms to the underlying network paths involved. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
On Wed, 2016-01-06 at 16:33 -0500, David Miller wrote: > A repeating pattern in drivers has become to use OF node information > and, if not found, platform specific host information to extract the > ethernet address for a given device. [] > diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h [] > @@ -485,3 +487,32 @@ static int __init eth_offload_init(void) > } > > fs_initcall(eth_offload_init); > + > +unsigned char * __weak arch_get_platform_mac_address(void) > +{ > + return NULL; WARN_ON_ONCE ? > +} > + > +int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr) > +{ > + const unsigned char *addr; > + struct device_node *dp; > + > + if (dev_is_pci(dev)) > + dp = pci_device_to_OF_node(to_pci_dev(dev)); > + else > + dp = dev->of_node; > + > + addr = NULL; > + if (dp) > + addr = of_get_mac_address(dp); > + if (!addr) > + addr = arch_get_platform_mac_address(); > + > + if (!addr) > + return -ENODEV; > + > + ether_addr_copy(mac_addr, addr); Couldn't some oddball arch have an unaligned addr? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: [PATCH net 0/3] ipv4: fix various issues reported by sparse
- Forwarded Message - > This trivial patch series addresses a number of endianness- and > lock-related issues reported by sparse. > > Lance Richardson (3): > ipv4: fix endianness warnings in ip_tunnel_core.c > ipv4: eliminate endianness warnings in ip_fib.h > ipv4: eliminate lock count warnings in ping.c > > include/net/ip_fib.h | 3 ++- > net/ipv4/ip_tunnel_core.c | 16 > net/ipv4/ping.c | 2 ++ > 3 files changed, 12 insertions(+), 9 deletions(-) > > -- > 2.5.0 > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Apologies, I must have fumbled the "--cc-cmd=" option to git send-email. Resending cover letter with cc: list. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
On Wed, Jan 6, 2016 at 12:59 PM, David Miller wrote: > From: David Wragg > Date: Wed, 6 Jan 2016 13:33:04 + > >> Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of >> any size, constrained only by the ability to transmit the resulting >> UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan >> vports. These netdevs have an MTU, which limits the size of a packet >> that can be successfully vxlan-encapsulated. The default value for >> this MTU is 1500, which is awkwardly small, and leads to a conspicuous >> change in behaviour for userspace. >> >> These two patches set the MTU on openvswitch-crated vxlan devices to >> be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 >> overhead), effectively restoring the behaviour prior to 4.3. In order >> to accomplish this, the first patch removes the MTU constraint of 1500 >> for vxlan netdevs without an underlying device. > > Is this really the right thing to do? Won't we get a lot of fragmentation > by using such a large MTU, especially since you're making it the default > for OVS setups? > > Things like path MTU discovery hinge strongly upon accurate MTU settings. > Otherwise they won't function properly. At a minimum, I don't think this should be VXLAN specific. But I agree that I'm not sure this is the right thing to do. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 3/3] ipv4: eliminate lock count warnings in ping.c
Add lock release/acquire annotations to ping_seq_start() and ping_seq_stop() to satisfy sparse. Signed-off-by: Lance Richardson --- net/ipv4/ping.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c index e89094a..c117b21 100644 --- a/net/ipv4/ping.c +++ b/net/ipv4/ping.c @@ -1063,6 +1063,7 @@ static struct sock *ping_get_idx(struct seq_file *seq, loff_t pos) } void *ping_seq_start(struct seq_file *seq, loff_t *pos, sa_family_t family) + __acquires(ping_table.lock) { struct ping_iter_state *state = seq->private; state->bucket = 0; @@ -1094,6 +1095,7 @@ void *ping_seq_next(struct seq_file *seq, void *v, loff_t *pos) EXPORT_SYMBOL_GPL(ping_seq_next); void ping_seq_stop(struct seq_file *seq, void *v) + __releases(ping_table.lock) { read_unlock_bh(&ping_table.lock); } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 1/3] ipv4: fix endianness warnings in ip_tunnel_core.c
Eliminate endianness mismatch warnings (reported by sparse) in this file by using appropriate nla_put_*()/nla_get_*() calls. Signed-off-by: Lance Richardson --- net/ipv4/ip_tunnel_core.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index 6cb9009..bf70c1c 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -251,7 +251,7 @@ static int ip_tun_build_state(struct net_device *dev, struct nlattr *attr, tun_info = lwt_tun_info(new_state); if (tb[LWTUNNEL_IP_ID]) - tun_info->key.tun_id = nla_get_u64(tb[LWTUNNEL_IP_ID]); + tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP_ID]); if (tb[LWTUNNEL_IP_DST]) tun_info->key.u.ipv4.dst = nla_get_be32(tb[LWTUNNEL_IP_DST]); @@ -266,7 +266,7 @@ static int ip_tun_build_state(struct net_device *dev, struct nlattr *attr, tun_info->key.tos = nla_get_u8(tb[LWTUNNEL_IP_TOS]); if (tb[LWTUNNEL_IP_FLAGS]) - tun_info->key.tun_flags = nla_get_u16(tb[LWTUNNEL_IP_FLAGS]); + tun_info->key.tun_flags = nla_get_be16(tb[LWTUNNEL_IP_FLAGS]); tun_info->mode = IP_TUNNEL_INFO_TX; tun_info->options_len = 0; @@ -281,12 +281,12 @@ static int ip_tun_fill_encap_info(struct sk_buff *skb, { struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate); - if (nla_put_u64(skb, LWTUNNEL_IP_ID, tun_info->key.tun_id) || + if (nla_put_be64(skb, LWTUNNEL_IP_ID, tun_info->key.tun_id) || nla_put_be32(skb, LWTUNNEL_IP_DST, tun_info->key.u.ipv4.dst) || nla_put_be32(skb, LWTUNNEL_IP_SRC, tun_info->key.u.ipv4.src) || nla_put_u8(skb, LWTUNNEL_IP_TOS, tun_info->key.tos) || nla_put_u8(skb, LWTUNNEL_IP_TTL, tun_info->key.ttl) || - nla_put_u16(skb, LWTUNNEL_IP_FLAGS, tun_info->key.tun_flags)) + nla_put_be16(skb, LWTUNNEL_IP_FLAGS, tun_info->key.tun_flags)) return -ENOMEM; return 0; @@ -346,7 +346,7 @@ static int ip6_tun_build_state(struct net_device *dev, struct nlattr *attr, tun_info = lwt_tun_info(new_state); if (tb[LWTUNNEL_IP6_ID]) - tun_info->key.tun_id = nla_get_u64(tb[LWTUNNEL_IP6_ID]); + tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP6_ID]); if (tb[LWTUNNEL_IP6_DST]) tun_info->key.u.ipv6.dst = nla_get_in6_addr(tb[LWTUNNEL_IP6_DST]); @@ -361,7 +361,7 @@ static int ip6_tun_build_state(struct net_device *dev, struct nlattr *attr, tun_info->key.tos = nla_get_u8(tb[LWTUNNEL_IP6_TC]); if (tb[LWTUNNEL_IP6_FLAGS]) - tun_info->key.tun_flags = nla_get_u16(tb[LWTUNNEL_IP6_FLAGS]); + tun_info->key.tun_flags = nla_get_be16(tb[LWTUNNEL_IP6_FLAGS]); tun_info->mode = IP_TUNNEL_INFO_TX | IP_TUNNEL_INFO_IPV6; tun_info->options_len = 0; @@ -376,12 +376,12 @@ static int ip6_tun_fill_encap_info(struct sk_buff *skb, { struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate); - if (nla_put_u64(skb, LWTUNNEL_IP6_ID, tun_info->key.tun_id) || + if (nla_put_be64(skb, LWTUNNEL_IP6_ID, tun_info->key.tun_id) || nla_put_in6_addr(skb, LWTUNNEL_IP6_DST, &tun_info->key.u.ipv6.dst) || nla_put_in6_addr(skb, LWTUNNEL_IP6_SRC, &tun_info->key.u.ipv6.src) || nla_put_u8(skb, LWTUNNEL_IP6_HOPLIMIT, tun_info->key.tos) || nla_put_u8(skb, LWTUNNEL_IP6_TC, tun_info->key.ttl) || - nla_put_u16(skb, LWTUNNEL_IP6_FLAGS, tun_info->key.tun_flags)) + nla_put_be16(skb, LWTUNNEL_IP6_FLAGS, tun_info->key.tun_flags)) return -ENOMEM; return 0; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 0/3] ipv4: fix various issues reported by sparse
This trivial patch series addresses a number of endianness- and lock-related issues reported by sparse. Lance Richardson (3): ipv4: fix endianness warnings in ip_tunnel_core.c ipv4: eliminate endianness warnings in ip_fib.h ipv4: eliminate lock count warnings in ping.c include/net/ip_fib.h | 3 ++- net/ipv4/ip_tunnel_core.c | 16 net/ipv4/ping.c | 2 ++ 3 files changed, 12 insertions(+), 9 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 2/3] ipv4: eliminate endianness warnings in ip_fib.h
fib_multipath_hash() computes a hash using __be32 values, force cast these to u32 to pacify sparse. Signed-off-by: Lance Richardson --- include/net/ip_fib.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 9f4df68..7029527 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -325,7 +325,8 @@ extern u32 fib_multipath_secret __read_mostly; static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) { - return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1; + return jhash_2words((__force u32)saddr, (__force u32)daddr, + fib_multipath_secret) >> 1; } void fib_select_multipath(struct fib_result *res, int hash); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] bpf: cleanup bpf_prog_run_{save,clear}_cb helpers
On Wed, Jan 06, 2016 at 10:32:16PM +0100, Daniel Borkmann wrote: > Move the details behind the cb[] access into a small helper to decouple > and make them generic for bpf_prog_run_save_cb()/bpf_prog_run_clear_cb() > that was introduced via commit ff936a04e5f2 ("bpf: fix cb access in socket > filter programs"). Also add a comment to better clarify what is done in > bpf_skb_cb(). > > Signed-off-by: Daniel Borkmann Thanks. Great catch. Acked-by: Alexei Starovoitov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arm64: net: bpf: don't BUG() on large shifts
On Wed, Jan 06, 2016 at 09:31:27PM +0100, Rabin Vincent wrote: > On Tue, Jan 05, 2016 at 09:55:58AM -0800, Alexei Starovoitov wrote: > > this one is better to be addressed in verifier instead of eBPF JITs. > > Please reject it in check_alu_op() instead. > > AFAICS the eBPF verifier is not called on the eBPF filters generated by > the BPF->eBPF conversion in net/core/filter.c, so performing this check > only in check_alu_op() will be insufficient. So I think we'd need to > add this check to bpf_check_classic() too. Or am I missing something? correct. the check is needed in bpf_check_classic() too and I believe it's ok to tighten it up in this case, since >32 shift is invalid/undefined anyway. We can either accept it as nop or K&=31 or error. I think returning error is more user friendly long term, though there is a small risk of rejecting previously loadable broken programs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT] Networking
As usual, there are a couple straggler bug fixes: 1) qlcnic_alloc_mbx_args() error returns are not checked in qlcnic driver. Fix from Insu Yun. 2) SKB refcounting bug in connector, from Florian Westphal. 3) vrf_get_saddr() has to propagate fib_lookup() errors to it's callers, from David Ahern. 4) Fix AF_UNIX splice/bind deadlock, from Rainer Weikusat. 5) qdisc_rcu_free() fails to free the per-cpu qstats. Fix from John Fastabend. 6) vmxnet3 driver passes wrong page to dma_map_page(), fix from Shrikrishna Khare. 7) Don't allow zero cwnd in tcp_cwnd_reduction(), from Yuchung Cheng. Please pull, thanks a lot! The following changes since commit 9c982e86dbdbaa3fb248dfc776a32cbc8927: Merge tag 'pci-v4.4-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci (2015-12-31 14:59:21 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to 8b8a321ff72c785ed5e8b4cf6eda20b35d427390: tcp: fix zero cwnd in tcp_cwnd_reduction (2016-01-06 16:39:56 -0500) Alan (1): mkiss: fix scribble on freed memory David Ahern (1): net: Propagate lookup failure in l3mdev_get_saddr to caller Florian Westphal (1): connector: bump skb->users before callback invocation Francesco Ruggeri (1): net: possible use after free in dst_release Hannes Frederic Sowa (1): bridge: Only call /sbin/bridge-stp for the initial network namespace Insu Yun (2): qlcnic: correctly handle qlcnic_alloc_mbx_args cxgb4: correctly handling failed allocation John Fastabend (1): net: sched: fix missing free per cpu on qstats Kristian Evensen (1): net: qmi_wwan: Add WeTelecom-WPD600N One Thousand Gnomes (1): 6pack: fix free memory scribbles Rabin Vincent (2): net: filter: make JITs zero A for SKF_AD_ALU_XOR_X ARM: net: bpf: fix zero right shift Rainer Weikusat (1): af_unix: Fix splice-bind deadlock Shrikrishna Khare (1): Driver: Vmxnet3: Fix regression caused by 5738a09 Yuchung Cheng (1): tcp: fix zero cwnd in tcp_cwnd_reduction hayeswang (1): r8152: add reset_resume function arch/arm/net/bpf_jit_32.c | 19 +++ arch/mips/net/bpf_jit.c | 16 +--- arch/powerpc/net/bpf_jit_comp.c | 13 ++--- arch/sparc/net/bpf_jit_comp.c | 17 ++--- drivers/connector/connector.c | 11 +++ drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c | 4 drivers/net/ethernet/qlogic/qlcnic/qlcnic_ctx.c | 6 -- drivers/net/hamradio/6pack.c| 6 ++ drivers/net/hamradio/mkiss.c| 5 + drivers/net/usb/qmi_wwan.c | 1 + drivers/net/usb/r8152.c | 10 +- drivers/net/vmxnet3/vmxnet3_drv.c | 8 drivers/net/vmxnet3/vmxnet3_int.h | 4 ++-- drivers/net/vrf.c | 10 +++--- include/linux/filter.h | 19 +++ include/net/l3mdev.h| 16 ++-- include/net/route.h | 7 ++- net/bridge/br_stp_if.c | 5 - net/core/dst.c | 3 ++- net/ipv4/raw.c | 7 +-- net/ipv4/tcp_input.c| 3 +++ net/ipv4/udp.c | 7 +-- net/sched/sch_generic.c | 4 +++- net/unix/af_unix.c | 66 -- 24 files changed, 150 insertions(+), 117 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
Thadeu Lima de Souza Cascardo wrote: > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote: > > On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang wrote: > > > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov > > > wrote: > > >> Looks like this happens because ip_options_fragment() relies on > > >> correct ip options length in ip control block in skb. But in > > >> ip_finish_output_gso() control block in segments is reused by > > >> skb_gso_segment(). following ip_fragment() sees some garbage. > > >> > > >> In my case there was no ip options but length becomes non-zero and > > >> ip_options_fragment() picked some bytes from payload and decides to > > >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags > > >> in skb_shared_info at the end of data =) > > >> > > > > > > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment() > > > since all the gso information should be saved in shared_info after it > > > finishes. > > > > > > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well? > > > > This will break present logic around ip_options_fragment() - it clears > > options from > > second and following fragments. With zeroed cb it will do nothing. > > > > ip_options_fragment() can get required information directly from ip header > > but > > it also resets fields in IPCB -- probably it should stay valid here > > and somebody else will use it later. [..] > I have hit this as well, this fixes it for me on an older kernel. Can you try > it > on latest kernel? > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index d8a1745..f44bc91 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb) > netdev_features_t features; > struct sk_buff *segs; > int ret = 0; > + struct inet_skb_parm ipcb; > > if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb)) > return ip_finish_output2(skb); > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb) >* 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly >* from host network stack. >*/ > + /* We need to save IPCB here because skb_gso_segment will use > + * SKB_GSO_CB. > + */ > + ipcb = *IPCB(skb); > features = netif_skb_features(skb); > segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); > if (IS_ERR_OR_NULL(segs)) { > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb) > int err; > > segs->next = NULL; > + *IPCB(segs) = ipcb; > err = ip_fragment(segs, ip_finish_output2); > > if (err && ret == 0) I'm worried that this doesn't solve all cases. f.e. xfrm may also call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter postrouting + ipv4 output functions... nfqnl_enqueue_packet() is also affected. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V1 10/12] net/mlx5_core: Export flow steering API
From: Maor Gottlieb Add exports to flow steering API for mlx5_ib usage. The following functions are exported: 1. mlx5_create_auto_grouped_flow_table - used to create flow table with auto flow grouping management (create and destroy flow groups). In auto-grouped flow tables, we create groups automatically if needed (if we don't find an existing flow group with same match criteria when we add new rule). 2. mlx5_destroy_flow_table - used to destroy a flow table. 3. mlx5_add_flow_rule - used to add flow rule into a flow table. 4. mlx5_del_flow_rule - used to delete flow rule from its flow table. 5. mlx5_get_flow_namespace - used to get a handle to the required namespace sub-tree. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 757725b..6f68dba 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -702,6 +702,7 @@ struct mlx5_flow_table *mlx5_create_auto_grouped_flow_table(struct mlx5_flow_nam return ft; } +EXPORT_SYMBOL(mlx5_create_auto_grouped_flow_table); /* Flow table should be locked */ static struct mlx5_flow_group *create_flow_group_common(struct mlx5_flow_table *ft, @@ -1013,11 +1014,13 @@ unlock: unlock_ref_node(&ft->node); return rule; } +EXPORT_SYMBOL(mlx5_add_flow_rule); void mlx5_del_flow_rule(struct mlx5_flow_rule *rule) { tree_remove_node(&rule->node); } +EXPORT_SYMBOL(mlx5_del_flow_rule); /* Assuming prio->node.children(flow tables) is sorted by level */ static struct mlx5_flow_table *find_next_ft(struct mlx5_flow_table *ft) @@ -1099,6 +1102,7 @@ int mlx5_destroy_flow_table(struct mlx5_flow_table *ft) return err; } +EXPORT_SYMBOL(mlx5_destroy_flow_table); void mlx5_destroy_flow_group(struct mlx5_flow_group *fg) { @@ -1143,6 +1147,7 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev, return ns; } +EXPORT_SYMBOL(mlx5_get_flow_namespace); static struct fs_prio *fs_create_prio(struct mlx5_flow_namespace *ns, unsigned prio, int max_ft) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V1 12/12] IB/mlx5: Add flow steering support
From: Maor Gottlieb Adding flow steering support by creating a flow-table per priority (if rules exist in the priority). mlx5_ib uses autogrouping and thus only creates the required destinations. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/infiniband/hw/mlx5/main.c| 285 ++ drivers/infiniband/hw/mlx5/mlx5_ib.h | 45 +- 2 files changed, 329 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index e16f13f..01f7ef5 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -45,6 +45,7 @@ #include #include #include +#include #include "user.h" #include "mlx5_ib.h" @@ -1012,6 +1013,281 @@ static bool is_valid_attr(struct ib_flow_attr *flow_attr) return !has_ipv4_spec || eth_type_ipv4; } +static void put_flow_table(struct mlx5_ib_dev *dev, + struct mlx5_ib_flow_prio *prio, bool ft_added) +{ + prio->refcount -= !!ft_added; + if (!prio->refcount) { + mlx5_destroy_flow_table(prio->flow_table); + prio->flow_table = NULL; + } +} + +static int mlx5_ib_destroy_flow(struct ib_flow *flow_id) +{ + struct mlx5_ib_dev *dev = to_mdev(flow_id->qp->device); + struct mlx5_ib_flow_handler *handler = container_of(flow_id, + struct mlx5_ib_flow_handler, + ibflow); + struct mlx5_ib_flow_handler *iter, *tmp; + + mutex_lock(&dev->flow_db.lock); + + list_for_each_entry_safe(iter, tmp, &handler->list, list) { + mlx5_del_flow_rule(iter->rule); + list_del(&iter->list); + kfree(iter); + } + + mlx5_del_flow_rule(handler->rule); + put_flow_table(dev, &dev->flow_db.prios[handler->prio], true); + mutex_unlock(&dev->flow_db.lock); + + kfree(handler); + + return 0; +} + +#define MLX5_FS_MAX_TYPES 10 +#define MLX5_FS_MAX_ENTRIES 32000UL +static struct mlx5_ib_flow_prio *get_flow_table(struct mlx5_ib_dev *dev, + struct ib_flow_attr *flow_attr) +{ + struct mlx5_flow_namespace *ns = NULL; + struct mlx5_ib_flow_prio *prio; + struct mlx5_flow_table *ft; + int num_entries; + int num_groups; + int priority; + int err = 0; + + if (flow_attr->type == IB_FLOW_ATTR_NORMAL) { + if (flow_is_multicast_only(flow_attr)) + priority = MLX5_IB_FLOW_MCAST_PRIO; + else + priority = flow_attr->priority; + ns = mlx5_get_flow_namespace(dev->mdev, +MLX5_FLOW_NAMESPACE_BYPASS); + num_entries = MLX5_FS_MAX_ENTRIES; + num_groups = MLX5_FS_MAX_TYPES; + prio = &dev->flow_db.prios[priority]; + } else if (flow_attr->type == IB_FLOW_ATTR_ALL_DEFAULT || + flow_attr->type == IB_FLOW_ATTR_MC_DEFAULT) { + ns = mlx5_get_flow_namespace(dev->mdev, +MLX5_FLOW_NAMESPACE_LEFTOVERS); + build_leftovers_ft_param(&priority, +&num_entries, +&num_groups); + prio = &dev->flow_db.prios[MLX5_IB_FLOW_LEFTOVERS_PRIO]; + } + + if (!ns) + return ERR_PTR(-ENOTSUPP); + + ft = prio->flow_table; + if (!ft) { + ft = mlx5_create_auto_grouped_flow_table(ns, priority, +num_entries, +num_groups); + + if (!IS_ERR(ft)) { + prio->refcount = 0; + prio->flow_table = ft; + } else { + err = PTR_ERR(ft); + } + } + + return err ? ERR_PTR(err) : prio; +} + +static struct mlx5_ib_flow_handler *create_flow_rule(struct mlx5_ib_dev *dev, +struct mlx5_ib_flow_prio *ft_prio, +struct ib_flow_attr *flow_attr, +struct mlx5_flow_destination *dst) +{ + struct mlx5_flow_table *ft = ft_prio->flow_table; + struct mlx5_ib_flow_handler *handler; + void *ib_flow = flow_attr + 1; + u8 match_criteria_enable = 0; + unsigned int spec_index; + u32 *match_c; + u32 *match_v; + int err = 0; + + if (!is_valid_attr(flow_attr)) + return ERR_PTR(-EINVAL); + + match_c = kzalloc(MLX5_ST_SZ_BYTES(fte_match_param), GFP_KERNEL); + match_v = kzalloc(MLX5
[PATCH net-next V1 08/12] net/mlx5_core: Enable flow steering support for the IB driver
From: Maor Gottlieb When the driver is loaded, we create flow steering namespace for kernel bypass with nine priorities and another namespace for leftovers(in order to catch packets that weren't matched). Verbs applications will use these priorities. we found nine as a number that balances the requirements from the user and retains performance. The bypass namespace is used by verbs applications that want to bypass the kernel networking stack. The leftovers namespace is used by verbs applications and the sniffer in order to catch packets that weren't handled by any preceding rules. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 55 ++--- include/linux/mlx5/device.h |2 + include/linux/mlx5/fs.h |2 + 3 files changed, 51 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 96e287a..757725b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -40,18 +40,19 @@ #define INIT_TREE_NODE_ARRAY_SIZE(...) (sizeof((struct init_tree_node[]){__VA_ARGS__}) /\ sizeof(struct init_tree_node)) -#define ADD_PRIO(min_level_val, max_ft_val, caps_val,\ +#define ADD_PRIO(num_prios_val, min_level_val, max_ft_val, caps_val,\ ...) {.type = FS_TYPE_PRIO,\ .min_ft_level = min_level_val,\ .max_ft = max_ft_val,\ + .num_leaf_prios = num_prios_val,\ .caps = caps_val,\ .children = (struct init_tree_node[]) {__VA_ARGS__},\ .ar_size = INIT_TREE_NODE_ARRAY_SIZE(__VA_ARGS__) \ } -#define ADD_FT_PRIO(max_ft_val, ...)\ - ADD_PRIO(0, max_ft_val, {},\ - __VA_ARGS__)\ +#define ADD_MULTIPLE_PRIO(num_prios_val, max_ft_val, ...)\ + ADD_PRIO(num_prios_val, 0, max_ft_val, {},\ +__VA_ARGS__)\ #define ADD_NS(...) {.type = FS_TYPE_NAMESPACE,\ .children = (struct init_tree_node[]) {__VA_ARGS__},\ @@ -66,7 +67,14 @@ #define FS_REQUIRED_CAPS(...) {.arr_sz = INIT_CAPS_ARRAY_SIZE(__VA_ARGS__), \ .caps = (long[]) {__VA_ARGS__} } +#define LEFTOVERS_MAX_FT 1 +#define LEFTOVERS_NUM_PRIOS 1 +#define BY_PASS_PRIO_MAX_FT 1 +#define BY_PASS_MIN_LEVEL (KENREL_MIN_LEVEL + MLX5_BY_PASS_NUM_PRIOS +\ + LEFTOVERS_MAX_FT) + #define KERNEL_MAX_FT 2 +#define KERNEL_NUM_PRIOS 1 #define KENREL_MIN_LEVEL 2 struct node_caps { @@ -79,14 +87,27 @@ static struct init_tree_node { int ar_size; struct node_caps caps; int min_ft_level; + int num_leaf_prios; int prio; int max_ft; } root_fs = { .type = FS_TYPE_NAMESPACE, - .ar_size = 1, + .ar_size = 3, .children = (struct init_tree_node[]) { - ADD_PRIO(KENREL_MIN_LEVEL, 0, {}, -ADD_NS(ADD_FT_PRIO(KERNEL_MAX_FT))), + ADD_PRIO(0, BY_PASS_MIN_LEVEL, 0, + FS_REQUIRED_CAPS(FS_CAP(flow_table_properties_nic_receive.flow_modify_en), + FS_CAP(flow_table_properties_nic_receive.modify_root), + FS_CAP(flow_table_properties_nic_receive.identified_miss_table_mode), + FS_CAP(flow_table_properties_nic_receive.flow_table_modify)), +ADD_NS(ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS, BY_PASS_PRIO_MAX_FT))), + ADD_PRIO(0, KENREL_MIN_LEVEL, 0, {}, +ADD_NS(ADD_MULTIPLE_PRIO(KERNEL_NUM_PRIOS, KERNEL_MAX_FT))), + ADD_PRIO(0, BY_PASS_MIN_LEVEL, 0, + FS_REQUIRED_CAPS(FS_CAP(flow_table_properties_nic_receive.flow_modify_en), + FS_CAP(flow_table_properties_nic_receive.modify_root), + FS_CAP(flow_table_properties_nic_receive.identified_miss_table_mode), + FS_CAP(flow_table_properties_nic_receive.flow_table_modify)), +ADD_NS(ADD_MULTIPLE_PRIO(LEFTOVERS_NUM_PRIOS, LEFTOVERS_MAX_FT))), } }; @@ -1098,8 +1119,10 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev, return NULL; switch (type) { + case MLX5_FLOW_NAMESPACE_BYPASS: case MLX5_FLOW_NAMESPACE_KERNEL: - prio = 0; + case MLX5_FLOW_NAMESPACE_LEFTOVERS: + prio = type; break; case MLX5_FLOW_NAMESPACE_FDB: if (dev->priv.fdb_root_ns) @@ -1164,6 +1187,20 @@ static struct mlx5_flow_namespace *fs_create_namespace(struct fs_prio *prio) return ns; } +static in
[PATCH net-next V1 05/12] net/mlx5_core: Connect flow tables
From: Maor Gottlieb Flow tables from different priorities should be chained together. When a packet arrives we search for a match in the by-pass flow tables (first we search for a match in priority 0 and if we don't find a match we move to the next priority). If we can't find a match in any of the bypass flow-tables, we continue searching in the flow-tables of the next priority, which are the kernel's flow tables. Setting the miss flow table in a new flow table to be the next one in the list is performed via create flow table API. If we want to change an existing flow table, for example in order to point from an existing flow table to the new next-in-list flow table, we use the modify flow table API. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c |7 +- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h |3 +- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 104 +++-- 3 files changed, 104 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c index 2b55625..a9894d2 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c @@ -58,7 +58,8 @@ int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev, int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev, enum fs_flow_table_type type, unsigned int level, - unsigned int log_size, unsigned int *table_id) + unsigned int log_size, struct mlx5_flow_table + *next_ft, unsigned int *table_id) { u32 out[MLX5_ST_SZ_DW(create_flow_table_out)]; u32 in[MLX5_ST_SZ_DW(create_flow_table_in)]; @@ -69,6 +70,10 @@ int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev, MLX5_SET(create_flow_table_in, in, opcode, MLX5_CMD_OP_CREATE_FLOW_TABLE); + if (next_ft) { + MLX5_SET(create_flow_table_in, in, table_miss_mode, 1); + MLX5_SET(create_flow_table_in, in, table_miss_id, next_ft->id); + } MLX5_SET(create_flow_table_in, in, table_type, type); MLX5_SET(create_flow_table_in, in, level, level); MLX5_SET(create_flow_table_in, in, log_size, log_size); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h index 1ae9b68..9814d47 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h @@ -35,7 +35,8 @@ int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev, enum fs_flow_table_type type, unsigned int level, - unsigned int log_size, unsigned int *table_id); + unsigned int log_size, struct mlx5_flow_table + *next_ft, unsigned int *table_id); int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev, struct mlx5_flow_table *ft); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 64bdb54..c6f864d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -510,6 +510,48 @@ static struct mlx5_flow_table *find_prev_chained_ft(struct fs_prio *prio) return find_closest_ft(prio, true); } +static int connect_fts_in_prio(struct mlx5_core_dev *dev, + struct fs_prio *prio, + struct mlx5_flow_table *ft) +{ + struct mlx5_flow_table *iter; + int i = 0; + int err; + + fs_for_each_ft(iter, prio) { + i++; + err = mlx5_cmd_modify_flow_table(dev, +iter, +ft); + if (err) { + mlx5_core_warn(dev, "Failed to modify flow table %d\n", + iter->id); + /* The driver is out of sync with the FW */ + if (i > 1) + WARN_ON(true); + return err; + } + } + return 0; +} + +/* Connect flow tables from previous priority of prio to ft */ +static int connect_prev_fts(struct mlx5_core_dev *dev, + struct mlx5_flow_table *ft, + struct fs_prio *prio) +{ + struct mlx5_flow_table *prev_ft; + + prev_ft = find_prev_chained_ft(prio); + if (prev_ft) { + struct fs_prio *prev_prio; + + fs_get_obj(prev_prio, prev_ft->node.parent); + return connect_fts_in_prio(dev, prev_prio, ft); + } + retur
[PATCH net-next V1 06/12] net/mlx5_core: Set priority attributes
From: Maor Gottlieb Each priority has two attributes: 1. max_ft - maximum allowed flow tables under this priority. 2. start_level - start level range of the flow tables in the priority. These attributes are set by traversing the tree nodes by DFS and set start level and max flow tables to each priority. Start level depends on the max flow tables of the prior priorities in the tree. The leaves of the trees have max_ft set in them. Each node accumulates the max_ft of its children and set it accordingly. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 71 +++-- drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |3 + 2 files changed, 55 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index c6f864d..e1282e8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -41,20 +41,19 @@ sizeof(struct init_tree_node)) #define INIT_PRIO(min_level_val, max_ft_val,\ - start_level_val, ...) {.type = FS_TYPE_PRIO,\ + ...) {.type = FS_TYPE_PRIO,\ .min_ft_level = min_level_val,\ - .start_level = start_level_val,\ .max_ft = max_ft_val,\ .children = (struct init_tree_node[]) {__VA_ARGS__},\ .ar_size = INIT_TREE_NODE_ARRAY_SIZE(__VA_ARGS__) \ } -#define ADD_PRIO(min_level_val, max_ft_val, start_level_val, ...)\ - INIT_PRIO(min_level_val, max_ft_val, start_level_val,\ +#define ADD_PRIO(min_level_val, max_ft_val, ...)\ + INIT_PRIO(min_level_val, max_ft_val,\ __VA_ARGS__)\ -#define ADD_FT_PRIO(max_ft_val, start_level_val, ...)\ - INIT_PRIO(0, max_ft_val, start_level_val,\ +#define ADD_FT_PRIO(max_ft_val, ...)\ + INIT_PRIO(0, max_ft_val,\ __VA_ARGS__)\ #define ADD_NS(...) {.type = FS_TYPE_NAMESPACE,\ @@ -62,8 +61,6 @@ .ar_size = INIT_TREE_NODE_ARRAY_SIZE(__VA_ARGS__) \ } -#define KERNEL_START_LEVEL 0 -#define KERNEL_P0_START_LEVEL KERNEL_START_LEVEL #define KERNEL_MAX_FT 2 #define KENREL_MIN_LEVEL 2 static struct init_tree_node { @@ -73,15 +70,12 @@ static struct init_tree_node { int min_ft_level; int prio; int max_ft; - int start_level; } root_fs = { .type = FS_TYPE_NAMESPACE, .ar_size = 1, .children = (struct init_tree_node[]) { - ADD_PRIO(KENREL_MIN_LEVEL, KERNEL_MAX_FT, -KERNEL_START_LEVEL, -ADD_NS(ADD_FT_PRIO(KERNEL_MAX_FT, - KERNEL_P0_START_LEVEL))), + ADD_PRIO(KENREL_MIN_LEVEL, 0, +ADD_NS(ADD_FT_PRIO(KERNEL_MAX_FT))), } }; @@ -1117,8 +,7 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev, } static struct fs_prio *fs_create_prio(struct mlx5_flow_namespace *ns, - unsigned prio, int max_ft, - int start_level) + unsigned prio, int max_ft) { struct fs_prio *fs_prio; @@ -1131,7 +1124,6 @@ static struct fs_prio *fs_create_prio(struct mlx5_flow_namespace *ns, tree_add_node(&fs_prio->node, &ns->node); fs_prio->max_ft = max_ft; fs_prio->prio = prio; - fs_prio->start_level = start_level; list_add_tail(&fs_prio->node.list, &ns->node.children); return fs_prio; @@ -1177,8 +1169,7 @@ static int init_root_tree_recursive(int max_ft_level, struct init_tree_node *ini return -ENOTSUPP; fs_get_obj(fs_ns, fs_parent_node); - fs_prio = fs_create_prio(fs_ns, index, init_node->max_ft, -init_node->start_level); + fs_prio = fs_create_prio(fs_ns, index, init_node->max_ft); if (IS_ERR(fs_prio)) return PTR_ERR(fs_prio); base = &fs_prio->node; @@ -1245,6 +1236,46 @@ static struct mlx5_flow_root_namespace *create_root_ns(struct mlx5_core_dev *dev return root_ns; } +static void set_prio_attrs_in_prio(struct fs_prio *prio, int acc_level); + +static int set_prio_attrs_in_ns(struct mlx5_flow_namespace *ns, int acc_level) +{ + struct fs_prio *prio; + + fs_for_each_prio(prio, ns) { +/* This updates prio start_level and max_ft */ + set_prio_attrs_in_prio(prio, acc_level); + acc_level += prio->max_ft; + } + return acc_level; +} + +static void set_prio_attrs_in_prio(struct fs_prio *prio, int acc_level) +{ + struct mlx5_flow_namespace *ns; + int acc_level_ns = acc_level; + + prio->start_level = acc
[PATCH net-next V1 09/12] net/mlx5_core: Make ipv4/ipv6 location more clear
From: Maor Gottlieb Change the mlx5 firmware interface header to make it more clear which bytes should be used by IPv4 or IPv6 addresses. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- include/linux/mlx5/mlx5_ifc.h | 20 ++-- 1 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 7f16695..68d73f8 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -298,6 +298,22 @@ struct mlx5_ifc_odp_per_transport_service_cap_bits { u8 reserved_1[0x1a]; }; +struct mlx5_ifc_ipv4_layout_bits { + u8 reserved_0[0x60]; + + u8 ipv4[0x20]; +}; + +struct mlx5_ifc_ipv6_layout_bits { + u8 ipv6[16][0x8]; +}; + +union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits { + struct mlx5_ifc_ipv6_layout_bits ipv6_layout; + struct mlx5_ifc_ipv4_layout_bits ipv4_layout; + u8 reserved_0[0x80]; +}; + struct mlx5_ifc_fte_match_set_lyr_2_4_bits { u8 smac_47_16[0x20]; @@ -328,9 +344,9 @@ struct mlx5_ifc_fte_match_set_lyr_2_4_bits { u8 udp_sport[0x10]; u8 udp_dport[0x10]; - u8 src_ip[4][0x20]; + union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6; - u8 dst_ip[4][0x20]; + union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6; }; struct mlx5_ifc_fte_match_set_misc_bits { -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V1 04/12] net/mlx5_core: Introduce modify flow table command
From: Maor Gottlieb Introduce the modify flow table command. This command is used when we want to change the next flow table of an existing flow table. The next flow table is defined as the table we search (in order to find a match), if we couldn't find a match in any of the flow table entries in the current flow table. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 27 ++ drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h |4 ++ include/linux/mlx5/mlx5_ifc.h| 56 -- 3 files changed, 83 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c index d8b1195..2b55625 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c @@ -101,6 +101,33 @@ int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev, sizeof(out)); } +int mlx5_cmd_modify_flow_table(struct mlx5_core_dev *dev, + struct mlx5_flow_table *ft, + struct mlx5_flow_table *next_ft) +{ + u32 in[MLX5_ST_SZ_DW(modify_flow_table_in)]; + u32 out[MLX5_ST_SZ_DW(modify_flow_table_out)]; + + memset(in, 0, sizeof(in)); + memset(out, 0, sizeof(out)); + + MLX5_SET(modify_flow_table_in, in, opcode, +MLX5_CMD_OP_MODIFY_FLOW_TABLE); + MLX5_SET(modify_flow_table_in, in, table_type, ft->type); + MLX5_SET(modify_flow_table_in, in, table_id, ft->id); + MLX5_SET(modify_flow_table_in, in, modify_field_select, +MLX5_MODIFY_FLOW_TABLE_MISS_TABLE_ID); + if (next_ft) { + MLX5_SET(modify_flow_table_in, in, table_miss_mode, 1); + MLX5_SET(modify_flow_table_in, in, table_miss_id, next_ft->id); + } else { + MLX5_SET(modify_flow_table_in, in, table_miss_mode, 0); + } + + return mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, + sizeof(out)); +} + int mlx5_cmd_create_flow_group(struct mlx5_core_dev *dev, struct mlx5_flow_table *ft, u32 *in, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h index 70d18ec..1ae9b68 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h @@ -40,6 +40,10 @@ int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev, int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev, struct mlx5_flow_table *ft); +int mlx5_cmd_modify_flow_table(struct mlx5_core_dev *dev, + struct mlx5_flow_table *ft, + struct mlx5_flow_table *next_ft); + int mlx5_cmd_create_flow_group(struct mlx5_core_dev *dev, struct mlx5_flow_table *ft, u32 *in, unsigned int *group_id); diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 323e713..7f16695 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -194,7 +194,8 @@ enum { MLX5_CMD_OP_QUERY_FLOW_GROUP = 0x935, MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY = 0x936, MLX5_CMD_OP_QUERY_FLOW_TABLE_ENTRY= 0x937, - MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY = 0x938 + MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY = 0x938, + MLX5_CMD_OP_MODIFY_FLOW_TABLE = 0x93c }; struct mlx5_ifc_flow_table_fields_supported_bits { @@ -260,7 +261,9 @@ struct mlx5_ifc_flow_table_prop_layout_bits { u8 reserved_0[0x2]; u8 flow_modify_en[0x1]; u8 modify_root[0x1]; - u8 reserved_1[0x1b]; + u8 identified_miss_table_mode[0x1]; + u8 flow_table_modify[0x1]; + u8 reserved_1[0x19]; u8 reserved_2[0x2]; u8 log_max_ft_size[0x6]; @@ -5669,12 +5672,16 @@ struct mlx5_ifc_create_flow_table_in_bits { u8 reserved_4[0x20]; - u8 reserved_5[0x8]; + u8 reserved_5[0x4]; + u8 table_miss_mode[0x4]; u8 level[0x8]; u8 reserved_6[0x8]; u8 log_size[0x8]; - u8 reserved_7[0x120]; + u8 reserved_7[0x8]; + u8 table_miss_id[0x18]; + + u8 reserved_8[0x100]; }; struct mlx5_ifc_create_flow_group_out_bits { @@ -6975,4 +6982,45 @@ struct mlx5_ifc_set_flow_table_root_in_bits { u8 reserved_5[0x140]; }; +enum { + MLX5_MODIFY_FLOW_TABLE_MISS_TABLE_ID = 0x1, +}; + +struct mlx5_ifc_modify_flow_table_out_bits {
[PATCH net-next V1 11/12] IB/mlx5: Add flow steering utilities
From: Maor Gottlieb Add three utility functions for support flow steering: 1. Parsing verbs flow attributes hardware steering specs. 2. Check if flow is multicast - this is required in order to decide to which flow table will we add the steering rule. 3. Set outer headers in flow match criteria to zeros. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/infiniband/hw/mlx5/main.c | 177 + include/linux/mlx5/fs.h | 10 ++ 2 files changed, 187 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 7e97cb5..e16f13f 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include #include "user.h" #include "mlx5_ib.h" @@ -835,6 +837,181 @@ static int mlx5_ib_dealloc_pd(struct ib_pd *pd) return 0; } +static bool outer_header_zero(u32 *match_criteria) +{ + int size = MLX5_ST_SZ_BYTES(fte_match_param); + char *outer_headers_c = MLX5_ADDR_OF(fte_match_param, match_criteria, +outer_headers); + + return outer_headers_c[0] == 0 && !memcmp(outer_headers_c, + outer_headers_c + 1, + size - 1); +} + +static int parse_flow_attr(u32 *match_c, u32 *match_v, + union ib_flow_spec *ib_spec) +{ + void *outer_headers_c = MLX5_ADDR_OF(fte_match_param, match_c, +outer_headers); + void *outer_headers_v = MLX5_ADDR_OF(fte_match_param, match_v, +outer_headers); + switch (ib_spec->type) { + case IB_FLOW_SPEC_ETH: + if (ib_spec->size != sizeof(ib_spec->eth)) + return -EINVAL; + + ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c, +dmac_47_16), + ib_spec->eth.mask.dst_mac); + ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v, +dmac_47_16), + ib_spec->eth.val.dst_mac); + + if (ib_spec->eth.mask.vlan_tag) { + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, +vlan_tag, 1); + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, +vlan_tag, 1); + + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, +first_vid, ntohs(ib_spec->eth.mask.vlan_tag)); + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, +first_vid, ntohs(ib_spec->eth.val.vlan_tag)); + + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, +first_cfi, +ntohs(ib_spec->eth.mask.vlan_tag) >> 12); + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, +first_cfi, +ntohs(ib_spec->eth.val.vlan_tag) >> 12); + + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, +first_prio, +ntohs(ib_spec->eth.mask.vlan_tag) >> 13); + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, +first_prio, +ntohs(ib_spec->eth.val.vlan_tag) >> 13); + } + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, +ethertype, ntohs(ib_spec->eth.mask.ether_type)); + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, +ethertype, ntohs(ib_spec->eth.val.ether_type)); + break; + case IB_FLOW_SPEC_IPV4: + if (ib_spec->size != sizeof(ib_spec->ipv4)) + return -EINVAL; + + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, +ethertype, 0x); + MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, +ethertype, ETH_P_IP); + + memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c, + src_ipv4_src_ipv6.ipv4_layout.ipv4), + &ib_spec->ipv4.mask.src_ip, + sizeof(ib_spec->ipv4.mask.src_ip)); + memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v, + src_ipv4_src_ipv6.ipv4_layout.ipv4), + &ib_spec->ipv4.val.src_ip, + sizeof(ib_spec->ipv4.val.src_ip)); + me
[PATCH net-next V1 00/12] net/mlx5_core: Enhance flow steering support
Hi Dave, This series adds three new functionalists to the driver flow-steering infrastructure: auto-grouped flow tables, chaining of flow tables and updates for the root flow table. Changes since V0: - Fixed improperly formatted comments. - Compare value of ib_spec->eth.mask.ether_type in network byte order in ('IB/mlx5: Add flow steering utilities'). 1. Auto-grouped flow tables - Flow table with auto grouping management. When a flow table is created, hints regarding the number of rule types and the number of rules are given in advance. Thus, a flow table is divided into #NUM_TYPES+1 groups each contains (#NUM_RULES)/(#NUM_TYPES+1) rules. The first #NUM_TYPES parts are groups which are filled if the added rule matches the group specification or the group is empty. The last part is filled by rules that can't fit any of the former groups. 2. Chaining flow tables - Flow tables from different priorities are chained together, if there is no match in flow table of priority i we continue searching for a match in priority i+1. This is both true if priorities i and i+1 belongs to the same namespace or not. 3. Updating the root flow table - the root flow table is the flow table with the lowest level. The hardware start searching for a match in the root flow table and continue according to the matches it find along the way. The first usage for the new functionality is flow steering for user-space ConnectX-4 offloaded HW Eth RX queues done through the mlx5 IB driver. When the mlx5 core driver is loaded, it opens three flow namespaces: 1. By-pass namespace (used by mlx5 IB driver). 2. Kernel namespace (used in order to get packets to the networking stack through mlx5 EN driver). 3. Leftovers namespace (used by mlx5 IB and future sniffer) The series is built as follows: Patch #1 introduces auto-grouped flow tables support. Patch #2 add utility functions for finding the next and the previous flow tables in different priorities. This is used in order to chain the flow tables in a downstream patch. Patch #3 introduces a firmware command for updating the root flow table. Patch #4 introduces modify flow table firmware command, this command is used when we want to change the next flow table of an existing flow table. This is used for chaining flow tables as well. Patch #5 connect/disconnect flow tables. This is actually the chaining process when we want to link flow tables. This means that if we couldn't find a match in the first flow table, we'll continue in the chained flow table. Patch #6 updates priority's attributes that is required for flow table level allocation. We update both the max_fts (the number of allowed FTs in the sub-tree of this priority) and the start_level (which is the first level we'll assign to the flow-tables created inside the priority). Patch #7 adds checking of required device capabilities. Some namespaces could be only created if the hardware supports certain attributes. This is especially true for the Bypass and leftovers namespaces. This adds a generic mechanism to check these required attributes. Patch #8 creates two additional namespaces: a. Bypass flow rules(has nine priorities) b. Leftovers packets(have one priority) - for unmatched packets. Patch #9 re-factors ipv4/ipv6 match fields in the mlx5 firmware interface header to be more clear. Patch #10 exports the flow steering API for mlx5_ib usage Patch #11 and #12 implements the required support in mlx5_ib in order to support the RDMA flow steering verbs. Regards, Moni, Matan and Maor Maor Gottlieb (12): net/mlx5_core: Introduce flow steering autogrouped flow table net/mlx5_core: Add utilities to find next and prev flow-tables net/mlx5_core: Managing root flow table net/mlx5_core: Introduce modify flow table command net/mlx5_core: Connect flow tables net/mlx5_core: Set priority attributes net/mlx5_core: Initialize namespaces only when supported by device net/mlx5_core: Enable flow steering support for the IB driver net/mlx5_core: Make ipv4/ipv6 location more clear net/mlx5_core: Export flow steering API IB/mlx5: Add flow steering utilities IB/mlx5: Add flow steering support drivers/infiniband/hw/mlx5/main.c | 462 drivers/infiniband/hw/mlx5/mlx5_ib.h | 45 ++- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 52 ++- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h |9 +- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 601 ++--- drivers/net/ethernet/mellanox/mlx5/core/fs_core.h | 14 + include/linux/mlx5/device.h |2 + include/linux/mlx5/fs.h | 18 + include/linux/mlx5/mlx5_ifc.h | 105 - 9 files changed, 1232 insertions(+), 76 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger
Re: [PATCH net] tcp: fix zero cwnd in tcp_cwnd_reduction
From: Yuchung Cheng Date: Wed, 6 Jan 2016 12:42:38 -0800 > Patch 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode > conditionally") introduced a bug that cwnd may become 0 when both > inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead > to a div-by-zero if the connection starts another cwnd reduction > phase by setting tp->prior_cwnd to the current cwnd (0) in > tcp_init_cwnd_reduction(). > > To prevent this we skip PRR operation when nothing is acked or > sacked. Then cwnd must be positive in all cases as long as ssthresh > is positive: > > 1) The proportional reduction mode >inflight > ssthresh > 0 > > 2) The reduction bound mode > a) inflight == ssthresh > 0 > > b) inflight < ssthresh > sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh > > Therefore in all cases inflight and sndcnt can not both be 0. > We check invalid tp->prior_cwnd to avoid potential div0 bugs. > > In reality this bug is triggered only with a sequence of less common > events. For example, the connection is terminating an ECN-triggered > cwnd reduction with an inflight 0, then it receives reordered/old > ACKs or DSACKs from prior transmission (which acks nothing). Or the > connection is in fast recovery stage that marks everything lost, > but fails to retransmit due to local issues, then receives data > packets from other end which acks nothing. > > Fixes: 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode > conditionally") > Reported-by: Oleksandr Natalenko > Signed-off-by: Yuchung Cheng > Signed-off-by: Neal Cardwell > Signed-off-by: Eric Dumazet Applied and queued up for -stable, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V1 02/12] net/mlx5_core: Add utilities to find next and prev flow-tables
From: Maor Gottlieb Add two utility functions for find next and prev flow table. Find next flow table function gets priority and return the first flow table of the next priority in the tree. Find prev flow table return the last flow table of the previous priority in the tree. These utility functions are used for chaining flow table from different priorities. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 67 + 1 files changed, 67 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 7d24bbb..c5a96e6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -443,6 +443,73 @@ static struct mlx5_flow_table *alloc_flow_table(int level, int max_fte, return ft; } +/* If reverse is false, then we search for the first flow table in the + * root sub-tree from start(closest from right), else we search for the + * last flow table in the root sub-tree till start(closest from left). + */ +static struct mlx5_flow_table *find_closest_ft_recursive(struct fs_node *root, +struct list_head *start, +bool reverse) +{ +#define list_advance_entry(pos, reverse) \ + ((reverse) ? list_prev_entry(pos, list) : list_next_entry(pos, list)) + +#define list_for_each_advance_continue(pos, head, reverse) \ + for (pos = list_advance_entry(pos, reverse);\ +&pos->list != (head); \ +pos = list_advance_entry(pos, reverse)) + + struct fs_node *iter = list_entry(start, struct fs_node, list); + struct mlx5_flow_table *ft = NULL; + + if (!root) + return NULL; + + list_for_each_advance_continue(iter, &root->children, reverse) { + if (iter->type == FS_TYPE_FLOW_TABLE) { + fs_get_obj(ft, iter); + return ft; + } + ft = find_closest_ft_recursive(iter, &iter->children, reverse); + if (ft) + return ft; + } + + return ft; +} + +/* If reverse if false then return the first flow table in next priority of + * prio in the tree, else return the last flow table in the previous priority + * of prio in the tree. + */ +static struct mlx5_flow_table *find_closest_ft(struct fs_prio *prio, bool reverse) +{ + struct mlx5_flow_table *ft = NULL; + struct fs_node *curr_node; + struct fs_node *parent; + + parent = prio->node.parent; + curr_node = &prio->node; + while (!ft && parent) { + ft = find_closest_ft_recursive(parent, &curr_node->list, reverse); + curr_node = parent; + parent = curr_node->parent; + } + return ft; +} + +/* Assuming all the tree is locked by mutex chain lock */ +static struct mlx5_flow_table *find_next_chained_ft(struct fs_prio *prio) +{ + return find_closest_ft(prio, false); +} + +/* Assuming all the tree is locked by mutex chain lock */ +static struct mlx5_flow_table *find_prev_chained_ft(struct fs_prio *prio) +{ + return find_closest_ft(prio, true); +} + struct mlx5_flow_table *mlx5_create_flow_table(struct mlx5_flow_namespace *ns, int prio, int max_fte) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next V1 01/12] net/mlx5_core: Introduce flow steering autogrouped flow table
From: Maor Gottlieb When user add rule to autogrouped flow table, we search for flow group with the same match criteria, if we don't find such group then we create new flow group with the required match criteria and insert the rule to this group. We divide the flow table into required_groups + 1, in order to reserve a part of the flow table for rules which don't match any existing group. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 166 ++--- drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |5 + include/linux/mlx5/fs.h |6 + 3 files changed, 158 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index f7d62fe..7d24bbb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -85,6 +85,12 @@ static struct init_tree_node { } }; +enum fs_i_mutex_lock_class { + FS_MUTEX_GRANDPARENT, + FS_MUTEX_PARENT, + FS_MUTEX_CHILD +}; + static void del_rule(struct fs_node *node); static void del_flow_table(struct fs_node *node); static void del_flow_group(struct fs_node *node); @@ -119,10 +125,11 @@ static void tree_get_node(struct fs_node *node) atomic_inc(&node->refcount); } -static void nested_lock_ref_node(struct fs_node *node) +static void nested_lock_ref_node(struct fs_node *node, +enum fs_i_mutex_lock_class class) { if (node) { - mutex_lock_nested(&node->lock, SINGLE_DEPTH_NESTING); + mutex_lock_nested(&node->lock, class); atomic_inc(&node->refcount); } } @@ -481,9 +488,7 @@ struct mlx5_flow_table *mlx5_create_flow_table(struct mlx5_flow_namespace *ns, list_add_tail(&ft->node.list, &fs_prio->node.children); fs_prio->num_ft++; unlock_ref_node(&fs_prio->node); - return ft; - free_ft: kfree(ft); unlock_prio: @@ -491,8 +496,32 @@ unlock_prio: return ERR_PTR(err); } -struct mlx5_flow_group *mlx5_create_flow_group(struct mlx5_flow_table *ft, - u32 *fg_in) +struct mlx5_flow_table *mlx5_create_auto_grouped_flow_table(struct mlx5_flow_namespace *ns, + int prio, + int num_flow_table_entries, + int max_num_groups) +{ + struct mlx5_flow_table *ft; + + if (max_num_groups > num_flow_table_entries) + return ERR_PTR(-EINVAL); + + ft = mlx5_create_flow_table(ns, prio, num_flow_table_entries); + if (IS_ERR(ft)) + return ft; + + ft->autogroup.active = true; + ft->autogroup.required_groups = max_num_groups; + + return ft; +} + +/* Flow table should be locked */ +static struct mlx5_flow_group *create_flow_group_common(struct mlx5_flow_table *ft, + u32 *fg_in, + struct list_head + *prev_fg, + bool is_auto_fg) { struct mlx5_flow_group *fg; struct mlx5_core_dev *dev = get_dev(&ft->node); @@ -505,18 +534,33 @@ struct mlx5_flow_group *mlx5_create_flow_group(struct mlx5_flow_table *ft, if (IS_ERR(fg)) return fg; - lock_ref_node(&ft->node); err = mlx5_cmd_create_flow_group(dev, ft, fg_in, &fg->id); if (err) { kfree(fg); - unlock_ref_node(&ft->node); return ERR_PTR(err); } + + if (ft->autogroup.active) + ft->autogroup.num_groups++; /* Add node to tree */ - tree_init_node(&fg->node, 1, del_flow_group); + tree_init_node(&fg->node, !is_auto_fg, del_flow_group); tree_add_node(&fg->node, &ft->node); /* Add node to group list */ list_add(&fg->node.list, ft->node.children.prev); + + return fg; +} + +struct mlx5_flow_group *mlx5_create_flow_group(struct mlx5_flow_table *ft, + u32 *fg_in) +{ + struct mlx5_flow_group *fg; + + if (ft->autogroup.active) + return ERR_PTR(-EPERM); + + lock_ref_node(&ft->node); + fg = create_flow_group_common(ft, fg_in, &ft->node.children, false); unlock_ref_node(&ft->node); return fg; @@ -614,7 +658,63 @@ static struct fs_fte *create_fte(struct mlx5_flow_group *fg, return fte; } -/* Assuming parent fg(flow table) is locked */ +static struct mlx5_flow_group *create_autogroup(struct mlx5_flow_table *ft, +
[PATCH net-next V1 03/12] net/mlx5_core: Managing root flow table
From: Maor Gottlieb The root Flow Table for each Flow Table Type is defined, by default, as the Flow Table with level 0. In order not to use an empty flow tables and introduce new hops, but still preserve space for flow-tables that have a priority greater(lower number) than the current flow table, we introduce this new set root flow table command. This command tells the HW to start matching packets from the assigned root flow table. This command is used when we create new flow table with level lower than the current lowest flow table or it is the first flow table. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 18 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h |2 + drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 97 +++-- drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |6 ++ include/linux/mlx5/mlx5_ifc.h | 31 +++- 5 files changed, 144 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c index 5096f4f..d8b1195 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c @@ -38,6 +38,24 @@ #include "fs_cmd.h" #include "mlx5_core.h" +int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev, + struct mlx5_flow_table *ft) +{ + u32 in[MLX5_ST_SZ_DW(set_flow_table_root_in)]; + u32 out[MLX5_ST_SZ_DW(set_flow_table_root_out)]; + + memset(in, 0, sizeof(in)); + + MLX5_SET(set_flow_table_root_in, in, opcode, +MLX5_CMD_OP_SET_FLOW_TABLE_ROOT); + MLX5_SET(set_flow_table_root_in, in, table_type, ft->type); + MLX5_SET(set_flow_table_root_in, in, table_id, ft->id); + + memset(out, 0, sizeof(out)); + return mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, + sizeof(out)); +} + int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev, enum fs_flow_table_type type, unsigned int level, unsigned int log_size, unsigned int *table_id) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h index f39304e..70d18ec 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h @@ -62,4 +62,6 @@ int mlx5_cmd_delete_fte(struct mlx5_core_dev *dev, struct mlx5_flow_table *ft, unsigned int index); +int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev, + struct mlx5_flow_table *ft); #endif diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index c5a96e6..64bdb54 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -510,6 +510,29 @@ static struct mlx5_flow_table *find_prev_chained_ft(struct fs_prio *prio) return find_closest_ft(prio, true); } +static int update_root_ft_create(struct mlx5_flow_table *ft, struct fs_prio +*prio) +{ + struct mlx5_flow_root_namespace *root = find_root(&prio->node); + int min_level = INT_MAX; + int err; + + if (root->root_ft) + min_level = root->root_ft->level; + + if (ft->level >= min_level) + return 0; + + err = mlx5_cmd_update_root_ft(root->dev, ft); + if (err) + mlx5_core_warn(root->dev, "Update root flow table of id=%u failed\n", + ft->id); + else + root->root_ft = ft; + + return err; +} + struct mlx5_flow_table *mlx5_create_flow_table(struct mlx5_flow_namespace *ns, int prio, int max_fte) @@ -526,14 +549,15 @@ struct mlx5_flow_table *mlx5_create_flow_table(struct mlx5_flow_namespace *ns, return ERR_PTR(-ENODEV); } + mutex_lock(&root->chain_lock); fs_prio = find_prio(ns, prio); - if (!fs_prio) - return ERR_PTR(-EINVAL); - - lock_ref_node(&fs_prio->node); + if (!fs_prio) { + err = -EINVAL; + goto unlock_root; + } if (fs_prio->num_ft == fs_prio->max_ft) { err = -ENOSPC; - goto unlock_prio; + goto unlock_root; } ft = alloc_flow_table(find_next_free_level(fs_prio), @@ -541,7 +565,7 @@ struct mlx5_flow_table *mlx5_create_flow_table(struct mlx5_flow_namespace *ns, root->table_type); if (!ft) { err = -ENOMEM; - goto unlock_prio; +
[PATCH net-next V1 07/12] net/mlx5_core: Initialize namespaces only when supported by device
From: Maor Gottlieb Before we create the sub tree of a steering namespaces(kernel, bypass, leftovers) we check that the device has the required capabilities in order to create this subtree. Signed-off-by: Maor Gottlieb Signed-off-by: Moni Shoua Signed-off-by: Matan Barak Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 70 ++-- 1 files changed, 49 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index e1282e8..96e287a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -40,20 +40,17 @@ #define INIT_TREE_NODE_ARRAY_SIZE(...) (sizeof((struct init_tree_node[]){__VA_ARGS__}) /\ sizeof(struct init_tree_node)) -#define INIT_PRIO(min_level_val, max_ft_val,\ - ...) {.type = FS_TYPE_PRIO,\ +#define ADD_PRIO(min_level_val, max_ft_val, caps_val,\ +...) {.type = FS_TYPE_PRIO,\ .min_ft_level = min_level_val,\ .max_ft = max_ft_val,\ + .caps = caps_val,\ .children = (struct init_tree_node[]) {__VA_ARGS__},\ .ar_size = INIT_TREE_NODE_ARRAY_SIZE(__VA_ARGS__) \ } -#define ADD_PRIO(min_level_val, max_ft_val, ...)\ - INIT_PRIO(min_level_val, max_ft_val,\ - __VA_ARGS__)\ - #define ADD_FT_PRIO(max_ft_val, ...)\ - INIT_PRIO(0, max_ft_val,\ + ADD_PRIO(0, max_ft_val, {},\ __VA_ARGS__)\ #define ADD_NS(...) {.type = FS_TYPE_NAMESPACE,\ @@ -61,12 +58,26 @@ .ar_size = INIT_TREE_NODE_ARRAY_SIZE(__VA_ARGS__) \ } +#define INIT_CAPS_ARRAY_SIZE(...) (sizeof((long[]){__VA_ARGS__}) /\ + sizeof(long)) + +#define FS_CAP(cap) (__mlx5_bit_off(flow_table_nic_cap, cap)) + +#define FS_REQUIRED_CAPS(...) {.arr_sz = INIT_CAPS_ARRAY_SIZE(__VA_ARGS__), \ + .caps = (long[]) {__VA_ARGS__} } + #define KERNEL_MAX_FT 2 #define KENREL_MIN_LEVEL 2 + +struct node_caps { + size_t arr_sz; + long*caps; +}; static struct init_tree_node { enum fs_node_type type; struct init_tree_node *children; int ar_size; + struct node_caps caps; int min_ft_level; int prio; int max_ft; @@ -74,7 +85,7 @@ static struct init_tree_node { .type = FS_TYPE_NAMESPACE, .ar_size = 1, .children = (struct init_tree_node[]) { - ADD_PRIO(KENREL_MIN_LEVEL, 0, + ADD_PRIO(KENREL_MIN_LEVEL, 0, {}, ADD_NS(ADD_FT_PRIO(KERNEL_MAX_FT))), } }; @@ -1153,11 +1164,31 @@ static struct mlx5_flow_namespace *fs_create_namespace(struct fs_prio *prio) return ns; } -static int init_root_tree_recursive(int max_ft_level, struct init_tree_node *init_node, +#define FLOW_TABLE_BIT_SZ 1 +#define GET_FLOW_TABLE_CAP(dev, offset) \ + ((be32_to_cpu(*((__be32 *)(dev->hca_caps_cur[MLX5_CAP_FLOW_TABLE]) + \ + offset / 32)) >> \ + (32 - FLOW_TABLE_BIT_SZ - (offset & 0x1f))) & FLOW_TABLE_BIT_SZ) +static bool has_required_caps(struct mlx5_core_dev *dev, struct node_caps *caps) +{ + int i; + + for (i = 0; i < caps->arr_sz; i++) { + if (!GET_FLOW_TABLE_CAP(dev, caps->caps[i])) + return false; + } + return true; +} + +static int init_root_tree_recursive(struct mlx5_core_dev *dev, + struct init_tree_node *init_node, struct fs_node *fs_parent_node, struct init_tree_node *init_parent_node, int index) { + int max_ft_level = MLX5_CAP_FLOWTABLE(dev, + flow_table_properties_nic_receive. + max_ft_level); struct mlx5_flow_namespace *fs_ns; struct fs_prio *fs_prio; struct fs_node *base; @@ -1165,8 +1196,9 @@ static int init_root_tree_recursive(int max_ft_level, struct init_tree_node *ini int err; if (init_node->type == FS_TYPE_PRIO) { - if (init_node->min_ft_level > max_ft_level) - return -ENOTSUPP; + if ((init_node->min_ft_level > max_ft_level) || + !has_required_caps(dev, &init_node->caps)) + return 0; fs_get_obj(fs_ns, fs_parent_node); fs_prio = fs_create_prio(fs_ns, index, init_node->max_ft); @@ -1183,9 +1215,8 @@ static int init_root_tree_recursive(int max_ft_level, struct init_tree_node *ini return -EINVAL; } for (i = 0; i < init_node->ar_size; i++) { - err = init_root_tree_recursive(max_ft_level, -
Re: [PATCH v2] net: Add eth_platform_get_mac_address() helper.
On (01/06/16 16:33), David Miller wrote: > > Signed-off-by: David S. Miller Acked-by: Sowmini Varadhan > > As promised back in November, I'm commiting this into net-next > now. I'd work on the conversions of existing drivers, but I'd > rather someone with access to the hardware to so. I'll look into this. --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] net: Add eth_platform_get_mac_address() helper.
A repeating pattern in drivers has become to use OF node information and, if not found, platform specific host information to extract the ethernet address for a given device. Currently this is done with a call to of_get_mac_address() and then some ifdef'd stuff for SPARC. Consolidate this into a portable routine, and provide the arch_get_platform_mac_address() weak function hook for all architectures to implement if they want. Signed-off-by: David S. Miller --- As promised back in November, I'm commiting this into net-next now. I'd work on the conversions of existing drivers, but I'd rather someone with access to the hardware to so. arch/sparc/kernel/idprom.c | 7 +++ include/linux/etherdevice.h | 3 +++ net/ethernet/eth.c | 31 +++ 3 files changed, 41 insertions(+) diff --git a/arch/sparc/kernel/idprom.c b/arch/sparc/kernel/idprom.c index 6bd7501..f95dd11 100644 --- a/arch/sparc/kernel/idprom.c +++ b/arch/sparc/kernel/idprom.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include @@ -60,6 +61,12 @@ static void __init display_system_type(unsigned char machtype) { } #endif + +unsigned char *arch_get_platform_mac_address(void) +{ + return idprom->id_ethaddr; +} + /* Calculate the IDPROM checksum (xor of the data bytes). */ static unsigned char __init calc_idprom_cksum(struct idprom *idprom) { diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index eb049c6..37ff4a6 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -29,6 +29,9 @@ #include #ifdef __KERNEL__ +struct device; +int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr); +unsigned char *arch_get_platform_get_mac_address(void); u32 eth_get_headlen(void *data, unsigned int max_len); __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev); extern const struct header_ops eth_header_ops; diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 9e63f25..1038717 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -52,6 +52,8 @@ #include #include #include +#include +#include #include #include #include @@ -485,3 +487,32 @@ static int __init eth_offload_init(void) } fs_initcall(eth_offload_init); + +unsigned char * __weak arch_get_platform_mac_address(void) +{ + return NULL; +} + +int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr) +{ + const unsigned char *addr; + struct device_node *dp; + + if (dev_is_pci(dev)) + dp = pci_device_to_OF_node(to_pci_dev(dev)); + else + dp = dev->of_node; + + addr = NULL; + if (dp) + addr = of_get_mac_address(dp); + if (!addr) + addr = arch_get_platform_mac_address(); + + if (!addr) + return -ENODEV; + + ether_addr_copy(mac_addr, addr); + return 0; +} +EXPORT_SYMBOL(eth_platform_get_mac_address); -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] bpf: cleanup bpf_prog_run_{save,clear}_cb helpers
Move the details behind the cb[] access into a small helper to decouple and make them generic for bpf_prog_run_save_cb()/bpf_prog_run_clear_cb() that was introduced via commit ff936a04e5f2 ("bpf: fix cb access in socket filter programs"). Also add a comment to better clarify what is done in bpf_skb_cb(). Signed-off-by: Daniel Borkmann --- include/linux/filter.h | 39 +-- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 294c3cd..30fa395 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -350,25 +350,43 @@ struct sk_filter { #define BPF_PROG_RUN(filter, ctx) (*filter->bpf_func)(ctx, filter->insnsi) +#define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN + +static inline u8 *bpf_skb_cb(struct sk_buff *skb) +{ + /* eBPF programs may read/write skb->cb[] area to transfer meta +* data between tail calls. Since this also needs to work with +* tc, that scratch memory is mapped to qdisc_skb_cb's data area. +* +* In some socket filter cases, the cb unfortunately needs to be +* saved/restored so that protocol specific skb->cb[] data won't +* be lost. In any case, due to unpriviledged eBPF programs +* attached to sockets, we need to clear the bpf_skb_cb() area +* to not leak previous contents to user space. +*/ + BUILD_BUG_ON(FIELD_SIZEOF(struct __sk_buff, cb) != BPF_SKB_CB_LEN); + BUILD_BUG_ON(FIELD_SIZEOF(struct __sk_buff, cb) != +FIELD_SIZEOF(struct qdisc_skb_cb, data)); + + return qdisc_skb_cb(skb)->data; +} + static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, struct sk_buff *skb) { - u8 *cb_data = qdisc_skb_cb(skb)->data; - u8 saved_cb[QDISC_CB_PRIV_LEN]; + u8 *cb_data = bpf_skb_cb(skb); + u8 cb_saved[BPF_SKB_CB_LEN]; u32 res; - BUILD_BUG_ON(FIELD_SIZEOF(struct __sk_buff, cb) != -QDISC_CB_PRIV_LEN); - if (unlikely(prog->cb_access)) { - memcpy(saved_cb, cb_data, sizeof(saved_cb)); - memset(cb_data, 0, sizeof(saved_cb)); + memcpy(cb_saved, cb_data, sizeof(cb_saved)); + memset(cb_data, 0, sizeof(cb_saved)); } res = BPF_PROG_RUN(prog, skb); if (unlikely(prog->cb_access)) - memcpy(cb_data, saved_cb, sizeof(saved_cb)); + memcpy(cb_data, cb_saved, sizeof(cb_saved)); return res; } @@ -376,10 +394,11 @@ static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, struct sk_buff *skb) { - u8 *cb_data = qdisc_skb_cb(skb)->data; + u8 *cb_data = bpf_skb_cb(skb); if (unlikely(prog->cb_access)) - memset(cb_data, 0, QDISC_CB_PRIV_LEN); + memset(cb_data, 0, BPF_SKB_CB_LEN); + return BPF_PROG_RUN(prog, skb); } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] Driver: Vmxnet3: Fix regression caused by 5738a09
From: Shrikrishna Khare Date: Wed, 6 Jan 2016 10:44:27 -0800 > Reported-by: Bingkuo Liu > Signed-off-by: Shrikrishna Khare Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] geneve: break dependency to network drivers
On 06.01.2016 22:01, Jesse Gross wrote: On Wed, Jan 6, 2016 at 12:25 PM, Hannes Frederic Sowa wrote: The refreshes from each module are completely synchronous and don't get interleaved, so as long as the driver is correctly handling the locking internally rtnl lock shouldn't be needed. But as I don't know too much about driver developing I can revisit this. As a advantage I see that the driver developers don't need to worry about the rtnl lock at all when adding new events. Is this realistic? I don't think that there is much savings to be had by avoiding RTNL since the majority of interactions that the driver has with the stack involve holding it anyways. In order to do this safely without RTNL we need to have a lock in each driver. I don't think that this is safely handled in all cases today and is likely to get worse in the future. I also noticed that Geneve actually doesn't hold any special lock while calling into drivers from geneve_get_rx_port() so it is de-facto relying on RTNL. All other operations in the Geneve driver are protected by RTNL currently, so we would need to introduce a new lock to handle this as well. In effect, it seems like people are implicitly assuming that these operations are covered by RTNL since most similar things are. Okay, on top of the v1 version I will check all drivers and add necessary rtnl_locks. Hopefully it works out and I don't have to defer calls into working queues in the drivers first. Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote: > On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang wrote: > > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov > > wrote: > >> Looks like this happens because ip_options_fragment() relies on > >> correct ip options length in ip control block in skb. But in > >> ip_finish_output_gso() control block in segments is reused by > >> skb_gso_segment(). following ip_fragment() sees some garbage. > >> > >> In my case there was no ip options but length becomes non-zero and > >> ip_options_fragment() picked some bytes from payload and decides to > >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags > >> in skb_shared_info at the end of data =) > >> > > > > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment() > > since all the gso information should be saved in shared_info after it > > finishes. > > > > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well? > > This will break present logic around ip_options_fragment() - it clears > options from > second and following fragments. With zeroed cb it will do nothing. > > ip_options_fragment() can get required information directly from ip header but > it also resets fields in IPCB -- probably it should stay valid here > and somebody else will use it later. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html I have hit this as well, this fixes it for me on an older kernel. Can you try it on latest kernel? diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index d8a1745..f44bc91 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb) netdev_features_t features; struct sk_buff *segs; int ret = 0; + struct inet_skb_parm ipcb; if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb)) return ip_finish_output2(skb); @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb) * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly * from host network stack. */ + /* We need to save IPCB here because skb_gso_segment will use +* SKB_GSO_CB. +*/ + ipcb = *IPCB(skb); features = netif_skb_features(skb); segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); if (IS_ERR_OR_NULL(segs)) { @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb) int err; segs->next = NULL; + *IPCB(segs) = ipcb; err = ip_fragment(segs, ip_finish_output2); if (err && ret == 0) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] geneve: break dependency to network drivers
On Wed, Jan 6, 2016 at 12:25 PM, Hannes Frederic Sowa wrote: > On 06.01.2016 20:52, Jesse Gross wrote: >> On Wed, Jan 6, 2016 at 10:48 AM, Hannes Frederic Sowa >> wrote: >>> On 06.01.2016 19:00, Jesse Gross wrote: Unfortunately, I don't think that we can assume that RTNL is held here. It actually is for the drivers that implement Geneve at this point but not in all cases for VXLAN. For example, ixgbe refreshes the offloads in a service task in addition to when it is opened. There's only a couple instances of things like this, so I guess it's probably not too hard to through and make sure that we hold RTNL in those cases. >>> >>> >>> >>> Hmm, I am tempted to switch over to the netevent atomic notifier chain >>> and >>> install those events there. It does not need rtnl lock at all, so we can >>> preserve the current semantics. What do you think? >> >> >> I think that holding RTNL while we do these updates is actually the >> right thing to do. The current situation of having calls from >> different protocols protected by different locks is not really a great >> model given that at the driver level these are usually shared data >> structures. RTNL is already held in the majority of cases already, so >> I think it is better to just convert the rest. > > > The refreshes from each module are completely synchronous and don't get > interleaved, so as long as the driver is correctly handling the locking > internally rtnl lock shouldn't be needed. But as I don't know too much about > driver developing I can revisit this. > > As a advantage I see that the driver developers don't need to worry about > the rtnl lock at all when adding new events. Is this realistic? I don't think that there is much savings to be had by avoiding RTNL since the majority of interactions that the driver has with the stack involve holding it anyways. In order to do this safely without RTNL we need to have a lock in each driver. I don't think that this is safely handled in all cases today and is likely to get worse in the future. I also noticed that Geneve actually doesn't hold any special lock while calling into drivers from geneve_get_rx_port() so it is de-facto relying on RTNL. All other operations in the Geneve driver are protected by RTNL currently, so we would need to introduce a new lock to handle this as well. In effect, it seems like people are implicitly assuming that these operations are covered by RTNL since most similar things are. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: move ndo_features_check() close to ndo_start_xmit()
From: Eric Dumazet Date: Wed, 06 Jan 2016 06:53:50 -0800 > From: Eric Dumazet > > TX fast path uses ndo_start_xmit(), ndo_features_check() and > ndo_select_queue(). > > Move ndo_features_check() close to ndo_start_xmit() to increase > data locality. > > All "struct net_device_ops" should now be using C99 initializers. > > Signed-off-by: Eric Dumazet Looks good, applied, thanks Eric. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 0/2] vxlan: Set a large MTU on ovs-created vxlan devices
From: David Wragg Date: Wed, 6 Jan 2016 13:33:04 + > Prior to 4.3, openvswitch vxlan vports could transmit vxlan packets of > any size, constrained only by the ability to transmit the resulting > UDP packets. 4.3 introduced vxlan netdevs corresponding to vxlan > vports. These netdevs have an MTU, which limits the size of a packet > that can be successfully vxlan-encapsulated. The default value for > this MTU is 1500, which is awkwardly small, and leads to a conspicuous > change in behaviour for userspace. > > These two patches set the MTU on openvswitch-crated vxlan devices to > be 65465 (the maximum IP packet size minus the vxlan-on-IPv6 > overhead), effectively restoring the behaviour prior to 4.3. In order > to accomplish this, the first patch removes the MTU constraint of 1500 > for vxlan netdevs without an underlying device. Is this really the right thing to do? Won't we get a lot of fragmentation by using such a large MTU, especially since you're making it the default for OVS setups? Things like path MTU discovery hinge strongly upon accurate MTU settings. Otherwise they won't function properly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: qmi_wwan: Add WeTelecom-WPD600N
From: Kristian Evensen Date: Wed, 6 Jan 2016 14:15:50 +0100 > From: Kristian Evensen > > The WeTelecom-WPD600N is an LTE module that, in addition to supporting most > "normal" bands, also supports LTE over 450MHz. Manual testing showed that > only interface number three replies to QMI messages. > > Cc: Bjørn Mork > Signed-off-by: Kristian Evensen Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] ixgbe: Fix for RAR0 not being set to default MAC addr
On (01/06/16 12:05), Tushar Dave wrote: > commit c9f53e63c208 ("ixgbe: Refactor MAC address configuration code") > introduced code that doesn't set HW register RAR0 to default mac address > but FF:FF:FF:FF:FF:FF. Due to this, ixgbe HW discards all incoming packets > that doesn't have destination mac address equals to FF:FF:FF:FF:FF:FF. > > This commit sets RAR0 correctly to default HW mac address. > > Signed-off-by: Tushar Dave Tested-by: Sowmini Varadhan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] tcp: fix zero cwnd in tcp_cwnd_reduction
Patch 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode conditionally") introduced a bug that cwnd may become 0 when both inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead to a div-by-zero if the connection starts another cwnd reduction phase by setting tp->prior_cwnd to the current cwnd (0) in tcp_init_cwnd_reduction(). To prevent this we skip PRR operation when nothing is acked or sacked. Then cwnd must be positive in all cases as long as ssthresh is positive: 1) The proportional reduction mode inflight > ssthresh > 0 2) The reduction bound mode a) inflight == ssthresh > 0 b) inflight < ssthresh sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh Therefore in all cases inflight and sndcnt can not both be 0. We check invalid tp->prior_cwnd to avoid potential div0 bugs. In reality this bug is triggered only with a sequence of less common events. For example, the connection is terminating an ECN-triggered cwnd reduction with an inflight 0, then it receives reordered/old ACKs or DSACKs from prior transmission (which acks nothing). Or the connection is in fast recovery stage that marks everything lost, but fails to retransmit due to local issues, then receives data packets from other end which acks nothing. Fixes: 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode conditionally") Reported-by: Oleksandr Natalenko Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- net/ipv4/tcp_input.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2d656ee..d4c5115 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2478,6 +2478,9 @@ static void tcp_cwnd_reduction(struct sock *sk, const int prior_unsacked, int newly_acked_sacked = prior_unsacked - (tp->packets_out - tp->sacked_out); + if (newly_acked_sacked <= 0 || WARN_ON_ONCE(!tp->prior_cwnd)) + return; + tp->prr_delivered += newly_acked_sacked; if (delta < 0) { u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered + -- 2.6.0.rc2.230.g3dd15c0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -next] fsl/fman: use the ALIGN() macro
On Wed, Jan 06, 2016 at 03:27:38PM -0500, David Miller wrote: > From: Dan Carpenter > Date: Wed, 6 Jan 2016 13:03:08 +0300 > > > diff --git a/drivers/net/ethernet/freescale/fman/fman_sp.c > > b/drivers/net/ethernet/freescale/fman/fman_sp.c > > index f9e7aa3..b527da1 100644 > > --- a/drivers/net/ethernet/freescale/fman/fman_sp.c > > +++ b/drivers/net/ethernet/freescale/fman/fman_sp.c > > @@ -92,11 +92,8 @@ int fman_sp_build_buffer_struct(struct > > fman_sp_int_context_data_copy * > > u32 tmp; > > > > /* Align start of internal context data to 16 byte */ > > - int_context_data_copy->ext_buf_offset = (u16) > > - ((buffer_prefix_content->priv_data_size & (OFFSET_UNITS - 1)) ? > > - ((buffer_prefix_content->priv_data_size + OFFSET_UNITS) & > > - ~(u16)(OFFSET_UNITS - 1)) : > > - buffer_prefix_content->priv_data_size); > > + int_context_data_copy->ext_buf_offset = > > + (u16)ALIGN(buffer_prefix_content->priv_data_size, 16); > > Why are you using '16' instead of OFFSET_UNITS? I made a brain fart. :/ I'll resend. Thanks. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arm64: net: bpf: don't BUG() on large shifts
On Tue, Jan 05, 2016 at 09:55:58AM -0800, Alexei Starovoitov wrote: > this one is better to be addressed in verifier instead of eBPF JITs. > Please reject it in check_alu_op() instead. AFAICS the eBPF verifier is not called on the eBPF filters generated by the BPF->eBPF conversion in net/core/filter.c, so performing this check only in check_alu_op() will be insufficient. So I think we'd need to add this check to bpf_check_classic() too. Or am I missing something? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mkiss: fix scribble on freed memory
Hello. On 1/6/2016 5:55 PM, Alan wrote: commit d79f16c046086f4fe0d42184a458e187464eb83e fixed a user triggerable scripts/checkpatch.pl now enforces certain format of the commit citing. scribble on free memory but added a new one which allows the user to scribble even more and user controlled data into freed space. As with 6pack we need to halt the queue before we free the buffers, because the transmit logic is not protected by the semaphore. Signed-off-by: Alan Cox [...] MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -next] fsl/fman: fix the pause_time test
From: Dan Carpenter Date: Wed, 6 Jan 2016 12:58:09 +0300 > pause_time is unsigned so it can't be less than zero. The bug means > that we allow invalid pause-times. > > Fixes: 57ba4c9b56d8 ('fsl/fman: Add FMan MAC support') > Signed-off-by: Dan Carpenter Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -next] fsl/fman: double free on probe failure
From: Dan Carpenter Date: Wed, 6 Jan 2016 12:59:10 +0300 > "priv" is allocated with devm_kzalloc() so freeing it here with kfree() > will lead to a double free. > > Fixes: 3933961682a3 ('fsl/fman: Add FMan MAC driver') > Signed-off-by: Dan Carpenter Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -next] fsl/fman: use the ALIGN() macro
From: Dan Carpenter Date: Wed, 6 Jan 2016 13:03:08 +0300 > diff --git a/drivers/net/ethernet/freescale/fman/fman_sp.c > b/drivers/net/ethernet/freescale/fman/fman_sp.c > index f9e7aa3..b527da1 100644 > --- a/drivers/net/ethernet/freescale/fman/fman_sp.c > +++ b/drivers/net/ethernet/freescale/fman/fman_sp.c > @@ -92,11 +92,8 @@ int fman_sp_build_buffer_struct(struct > fman_sp_int_context_data_copy * > u32 tmp; > > /* Align start of internal context data to 16 byte */ > - int_context_data_copy->ext_buf_offset = (u16) > - ((buffer_prefix_content->priv_data_size & (OFFSET_UNITS - 1)) ? > - ((buffer_prefix_content->priv_data_size + OFFSET_UNITS) & > - ~(u16)(OFFSET_UNITS - 1)) : > - buffer_prefix_content->priv_data_size); > + int_context_data_copy->ext_buf_offset = > + (u16)ALIGN(buffer_prefix_content->priv_data_size, 16); Why are you using '16' instead of OFFSET_UNITS? Unless you can come up with a good reason, please use OFFSET_UNITS instead of a magic constant. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] geneve: break dependency to network drivers
Hi Jesse, hmpf, I saw your mail too late and send out another series just now. On 06.01.2016 20:52, Jesse Gross wrote: On Wed, Jan 6, 2016 at 10:48 AM, Hannes Frederic Sowa wrote: On 06.01.2016 19:00, Jesse Gross wrote: On Wed, Jan 6, 2016 at 7:41 AM, Hannes Frederic Sowa wrote: diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 24b077a32c1c9c..548925d1571cb1 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c +static int geneve_notifier(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = netdev_notifier_info_to_dev(ptr); + + switch (event) { + case NETDEV_REFRESH_OFFLOAD_VXLAN: + geneve_notify_refresh_netdev(dev); Presumably this should be NETDEV_REFRESH_OFFLOAD_GENEVE, not VXLAN. However, rather than having a notifier for each protocol, it seems like it might be cleaner to just have a single one that triggers for all protocols and drivers that don't have the corresponding NDO wouldn't get called, similar to what happens when the port gets added in the first place. Ah, thanks for noticing the typo. The reason why I went with several types is that I didn't want to change the behavior and wasn't sure if driver tested with reoccurring offload refreshes to the driver. What you described was my first patch but because I couldn't see if that works for all drivers I went this way. Hmm, I see what you mean but I think it should be safe. All drivers that have both Geneve and VXLAN offloads make calls to refresh them back to back, which is a pattern that I would expect to continue. In that case, having a single notifier that triggers multiple protocols would have the same effect and is simpler. My new series adds a new netdev_notifier which is atomic and can deal with non locked rtnl events. I thought maybe it would be useful for future callbacks, too. I am currently looking into ptp and reduce the dependency there, if possible. diff --git a/include/net/geneve.h b/include/net/geneve.h index e6c23dc765f7ec..36245115143652 100644 --- a/include/net/geneve.h +++ b/include/net/geneve.h [...] static inline void geneve_get_rx_port(struct net_device *netdev) { + call_netdevice_notifiers(NETDEV_REFRESH_OFFLOAD_GENEVE, netdev); } Unfortunately, I don't think that we can assume that RTNL is held here. It actually is for the drivers that implement Geneve at this point but not in all cases for VXLAN. For example, ixgbe refreshes the offloads in a service task in addition to when it is opened. There's only a couple instances of things like this, so I guess it's probably not too hard to through and make sure that we hold RTNL in those cases. Hmm, I am tempted to switch over to the netevent atomic notifier chain and install those events there. It does not need rtnl lock at all, so we can preserve the current semantics. What do you think? I think that holding RTNL while we do these updates is actually the right thing to do. The current situation of having calls from different protocols protected by different locks is not really a great model given that at the driver level these are usually shared data structures. RTNL is already held in the majority of cases already, so I think it is better to just convert the rest. The refreshes from each module are completely synchronous and don't get interleaved, so as long as the driver is correctly handling the locking internally rtnl lock shouldn't be needed. But as I don't know too much about driver developing I can revisit this. As a advantage I see that the driver developers don't need to worry about the rtnl lock at all when adding new events. Is this realistic? Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 3/3] geneve: use netdev_atomic notifier chain to remove dependency from drivers
Signed-off-by: Hannes Frederic Sowa --- drivers/net/geneve.c | 30 +++--- include/linux/netdevice.h | 1 + include/net/geneve.h | 6 ++ 3 files changed, 30 insertions(+), 7 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 24b077a32c1c9c..e04a7678438999 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -1110,7 +1110,7 @@ static struct device_type geneve_type = { * supply the listening GENEVE udp ports. Callers are expected * to implement the ndo_add_geneve_port. */ -void geneve_get_rx_port(struct net_device *dev) +static void geneve_netdev_refresh_offload(struct net_device *dev) { struct net *net = dev_net(dev); struct geneve_net *gn = net_generic(net, geneve_net_id); @@ -1128,7 +1128,6 @@ void geneve_get_rx_port(struct net_device *dev) } rcu_read_unlock(); } -EXPORT_SYMBOL_GPL(geneve_get_rx_port); /* Initialize the device structure. */ static void geneve_setup(struct net_device *dev) @@ -1427,6 +1426,24 @@ static struct rtnl_link_ops geneve_link_ops __read_mostly = { .fill_info = geneve_fill_info, }; +static int geneve_atomic_event(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = ptr; + + switch (event) { + case NETDEV_OFFLOAD_REFRESH_GENEVE: + geneve_netdev_refresh_offload(dev); + break; + } + + return NOTIFY_DONE; +} + +static struct notifier_block geneve_atomic_notifier_block __read_mostly = { + .notifier_call = geneve_atomic_event, +}; + struct net_device *geneve_dev_create_fb(struct net *net, const char *name, u8 name_assign_type, u16 dst_port) { @@ -1502,11 +1519,17 @@ static int __init geneve_init_module(void) if (rc) goto out1; - rc = rtnl_link_register(&geneve_link_ops); + rc = register_netdev_atomic_notifier(&geneve_atomic_notifier_block); if (rc) goto out2; + rc = rtnl_link_register(&geneve_link_ops); + if (rc) + goto out3; + return 0; +out3: + unregister_netdev_atomic_notifier(&geneve_atomic_notifier_block); out2: unregister_pernet_subsys(&geneve_net_ops); out1: @@ -1517,6 +1540,7 @@ late_initcall(geneve_init_module); static void __exit geneve_cleanup_module(void) { rtnl_link_unregister(&geneve_link_ops); + unregister_netdev_atomic_notifier(&geneve_atomic_notifier_block); unregister_pernet_subsys(&geneve_net_ops); } module_exit(geneve_cleanup_module); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b2b943b1896c57..21cdb1afc5cd20 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2225,6 +2225,7 @@ int call_netdevice_notifiers(unsigned long val, struct net_device *dev); enum netdev_atomic_callback_type { NETDEV_OFFLOAD_REFRESH_VXLAN = 0x1UL, + NETDEV_OFFLOAD_REFRESH_GENEVE = 0x2UL, }; int register_netdev_atomic_notifier(struct notifier_block *nb); diff --git a/include/net/geneve.h b/include/net/geneve.h index e6c23dc765f7ec..e09476ed8feb20 100644 --- a/include/net/geneve.h +++ b/include/net/geneve.h @@ -62,13 +62,11 @@ struct genevehdr { struct geneve_opt options[]; }; -#if IS_ENABLED(CONFIG_GENEVE) -void geneve_get_rx_port(struct net_device *netdev); -#else static inline void geneve_get_rx_port(struct net_device *netdev) { + call_netdev_atomic_notifiers(NETDEV_OFFLOAD_REFRESH_GENEVE, +netdev); } -#endif #ifdef CONFIG_INET struct net_device *geneve_dev_create_fb(struct net *net, const char *name, -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 2/3] vxlan: use netdev_atomic notifier chain to remove dependency from drivers
Signed-off-by: Hannes Frederic Sowa --- drivers/net/vxlan.c | 35 ++- include/linux/netdevice.h | 2 +- include/net/vxlan.h | 6 ++ 3 files changed, 33 insertions(+), 10 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index fecf7b6c732e96..47da2045ca1b19 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2468,7 +2468,7 @@ static struct device_type vxlan_type = { * supply the listening VXLAN udp ports. Callers are expected * to implement the ndo_add_vxlan_port. */ -void vxlan_get_rx_port(struct net_device *dev) +static void vxlan_netdev_refresh_offload(struct net_device *dev) { struct vxlan_sock *vs; struct net *net = dev_net(dev); @@ -2488,7 +2488,6 @@ void vxlan_get_rx_port(struct net_device *dev) } spin_unlock(&vn->sock_lock); } -EXPORT_SYMBOL_GPL(vxlan_get_rx_port); /* Initialize the device structure. */ static void vxlan_setup(struct net_device *dev) @@ -3180,6 +3179,24 @@ static struct notifier_block vxlan_notifier_block __read_mostly = { .notifier_call = vxlan_lowerdev_event, }; +static int vxlan_atomic_event(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = ptr; + + switch (event) { + case NETDEV_OFFLOAD_REFRESH_VXLAN: + vxlan_netdev_refresh_offload(dev); + break; + } + + return NOTIFY_DONE; +} + +static struct notifier_block vxlan_atomic_notifier_block __read_mostly = { + .notifier_call = vxlan_atomic_event, +}; + static __net_init int vxlan_init_net(struct net *net) { struct vxlan_net *vn = net_generic(net, vxlan_net_id); @@ -3241,17 +3258,24 @@ static int __init vxlan_init_module(void) if (rc) goto out1; - rc = register_netdevice_notifier(&vxlan_notifier_block); + rc = register_netdev_atomic_notifier(&vxlan_atomic_notifier_block); if (rc) goto out2; - rc = rtnl_link_register(&vxlan_link_ops); + rc = register_netdevice_notifier(&vxlan_notifier_block); if (rc) goto out3; + rc = rtnl_link_register(&vxlan_link_ops); + if (rc) + goto out4; + return 0; -out3: + +out4: unregister_netdevice_notifier(&vxlan_notifier_block); +out3: + unregister_netdev_atomic_notifier(&vxlan_atomic_notifier_block); out2: unregister_pernet_subsys(&vxlan_net_ops); out1: @@ -3263,6 +3287,7 @@ late_initcall(vxlan_init_module); static void __exit vxlan_cleanup_module(void) { rtnl_link_unregister(&vxlan_link_ops); + unregister_netdev_atomic_notifier(&vxlan_atomic_notifier_block); unregister_netdevice_notifier(&vxlan_notifier_block); destroy_workqueue(vxlan_wq); unregister_pernet_subsys(&vxlan_net_ops); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index fb6064c88fc259..b2b943b1896c57 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2224,7 +2224,7 @@ netdev_notifier_info_to_dev(const struct netdev_notifier_info *info) int call_netdevice_notifiers(unsigned long val, struct net_device *dev); enum netdev_atomic_callback_type { - PLACEHOLDER = 0x1UL, + NETDEV_OFFLOAD_REFRESH_VXLAN = 0x1UL, }; int register_netdev_atomic_notifier(struct notifier_block *nb); diff --git a/include/net/vxlan.h b/include/net/vxlan.h index 0fb86442544b26..cebe08569b5129 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -242,13 +242,11 @@ static inline netdev_features_t vxlan_features_check(struct sk_buff *skb, /* IPv6 header + UDP + VXLAN + Ethernet header */ #define VXLAN6_HEADROOM (40 + 8 + 8 + 14) -#if IS_ENABLED(CONFIG_VXLAN) -void vxlan_get_rx_port(struct net_device *netdev); -#else static inline void vxlan_get_rx_port(struct net_device *netdev) { + call_netdev_atomic_notifiers(NETDEV_OFFLOAD_REFRESH_VXLAN, +netdev); } -#endif static inline unsigned short vxlan_get_sk_family(struct vxlan_sock *vs) { -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 0/3] netdev: add netdev atomic callchain to avoid driver depending on vxlan/geneve
This small series adds a netdev_atomic notifier chain so that drivers can callback to the core network stack requesting to reprogram the offloads without depending on the geneve or vxlan modules. * Result: $ cd drivers/net/ethernet/ $ find . -name '*.ko' | xargs modinfo | egrep '^depends:.*(vxlan|geneve)' | wc -l 0 Hannes Frederic Sowa (3): netdev: add atomic netdev callback chain vxlan: use netdev_atomic notifier chain to remove dependency from drivers geneve: use netdev_atomic notifier chain to remove dependency from drivers drivers/net/geneve.c | 30 +++--- drivers/net/vxlan.c | 35 ++- include/linux/netdevice.h | 10 ++ include/net/geneve.h | 6 ++ include/net/vxlan.h | 6 ++ net/core/dev.c| 24 6 files changed, 95 insertions(+), 16 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 1/3] netdev: add atomic netdev callback chain
This atomic notifier is later used by drivers to deliver callbacks to the core network stack without requireing rtnl lock to be held. Signed-off-by: Hannes Frederic Sowa --- include/linux/netdevice.h | 9 + net/core/dev.c| 24 2 files changed, 33 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c20b814e46a072..fb6064c88fc259 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2223,6 +2223,15 @@ netdev_notifier_info_to_dev(const struct netdev_notifier_info *info) int call_netdevice_notifiers(unsigned long val, struct net_device *dev); +enum netdev_atomic_callback_type { + PLACEHOLDER = 0x1UL, +}; + +int register_netdev_atomic_notifier(struct notifier_block *nb); +int unregister_netdev_atomic_notifier(struct notifier_block *nb); +int call_netdev_atomic_notifiers(enum netdev_atomic_callback_type type, +struct net_device *dev); + extern rwlock_tdev_base_lock; /* Device list lock */ diff --git a/net/core/dev.c b/net/core/dev.c index 914b4a24c65436..e96248d955f1aa 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7840,6 +7840,30 @@ define_netdev_printk_level(netdev_warn, KERN_WARNING); define_netdev_printk_level(netdev_notice, KERN_NOTICE); define_netdev_printk_level(netdev_info, KERN_INFO); +static ATOMIC_NOTIFIER_HEAD(netdev_atomic_notifier_chain); + +int register_netdev_atomic_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_register(&netdev_atomic_notifier_chain, + nb); +} +EXPORT_SYMBOL_GPL(register_netdev_atomic_notifier); + +int unregister_netdev_atomic_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_unregister(&netdev_atomic_notifier_chain, + nb); +} +EXPORT_SYMBOL_GPL(unregister_netdev_atomic_notifier); + +int call_netdev_atomic_notifiers(enum netdev_atomic_callback_type type, +struct net_device *dev) +{ + return atomic_notifier_call_chain(&netdev_atomic_notifier_chain, + type, dev); +} +EXPORT_SYMBOL_GPL(call_netdev_atomic_notifiers); + static void __net_exit netdev_exit(struct net *net) { kfree(net->dev_name_head); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang wrote: > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov > wrote: >> Looks like this happens because ip_options_fragment() relies on >> correct ip options length in ip control block in skb. But in >> ip_finish_output_gso() control block in segments is reused by >> skb_gso_segment(). following ip_fragment() sees some garbage. >> >> In my case there was no ip options but length becomes non-zero and >> ip_options_fragment() picked some bytes from payload and decides to >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags >> in skb_shared_info at the end of data =) >> > > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment() > since all the gso information should be saved in shared_info after it > finishes. > > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well? This will break present logic around ip_options_fragment() - it clears options from second and following fragments. With zeroed cb it will do nothing. ip_options_fragment() can get required information directly from ip header but it also resets fields in IPCB -- probably it should stay valid here and somebody else will use it later. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -next] mlxsw: core: remove an unnecessary condition
From: Dan Carpenter Date: Wed, 6 Jan 2016 12:56:30 +0300 > We checked "err" on the lines before so we know it's zero here. > > These cause a static checker warning because checking known things can > indicate a bug. Maybe there is a missing assignment or we are checking > the wrong variable. > > Signed-off-by: Dan Carpenter Applied, thanks Dan. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mkiss: fix scribble on freed memory
From: Alan Date: Wed, 06 Jan 2016 14:55:02 + > commit d79f16c046086f4fe0d42184a458e187464eb83e fixed a user triggerable > scribble on free memory but added a new one which allows the user to > scribble even more and user controlled data into freed space. > > As with 6pack we need to halt the queue before we free the buffers, because > the transmit logic is not protected by the semaphore. > > Signed-off-by: Alan Cox Applied and queued up for -stable, thanks Alan. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] ixgbe: Fix for RAR0 not being set to default MAC addr
commit c9f53e63c208 ("ixgbe: Refactor MAC address configuration code") introduced code that doesn't set HW register RAR0 to default mac address but FF:FF:FF:FF:FF:FF. Due to this, ixgbe HW discards all incoming packets that doesn't have destination mac address equals to FF:FF:FF:FF:FF:FF. This commit sets RAR0 correctly to default HW mac address. Signed-off-by: Tushar Dave --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index ea9537d..7e6ce13 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9084,6 +9084,8 @@ skip_sriov: goto err_sw_init; } + /* Set hw->addr to permanent mac address*/ + ether_addr_copy(hw->mac.addr, hw->mac.perm_addr); ixgbe_mac_set_default_filter(adapter); setup_timer(&adapter->service_timer, &ixgbe_service_timer, -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] ethernet/atheros/alx: sanitize buffer sizing and padding
From: Jarod Wilson Date: Wed, 6 Jan 2016 09:36:37 -0500 > This is based on the work done by Przemek Rudy in bug 70761 at > bugzilla.kernel.org, but with some work done to disentagle and clarify > things a bit. > > Similar to Przemek's work and other drivers, we're adding a padding of 16 > here, but we're also disentangling mtu size calculations from max buffer > size calculations a bit, and adding ETH_HLEN to the value written into > ALX_MTU. Hopefully, with a bit more consistency and clarity, things behave > better here. Sadly, I can only test in my alx-driven E2200, which worked > just fine before this patch. > > In comment #58 of bug 70761, Eugene A. Shatokhin reports that this patch > does help considerably for a ROSA Linux user of his with an AR8162 network > adapter when patched into a 4.1.x-based kernel, with several days of > normal operation where wired network previously wasn't usable without > setting MTU to 9000 as a work-around. > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=70761 > CC: "Eugene A. Shatokhin" > CC: Przemek Rudy > CC: Jay Cliburn > CC: Chris Snook > CC: netdev@vger.kernel.org > Signed-off-by: Jarod Wilson > --- > v2: remove superfluous parens around raw_mtu, pointed out by davem Looks great, applied, thanks Jarod. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64
Tom Herbert writes: > Also, we don't do anything special for alignment, unaligned > accesses on x86 do not appear to be a performance issue. This is not true on Atom CPUs. Also on most CPUs there is still a larger penalty when crossing cache lines. > Verified correctness by testing arbitrary length buffer filled with > random data. For each buffer I compared the computed checksum > using the original algorithm for each possible alignment (0-7 bytes). > > Checksum performance: > > Isolating old and new implementation for some common cases: You forgot to state the CPU. The results likely depend heavily on the micro architecture. The original C code was optimized for K8 FWIW. Overall your assembler looks similar to the C code, except for the jump table. Jump table has the disadvantage that it is much harder to branch predict, with a large penalty if it's mispredicted. I would expect it to be slower for cases where the length changes frequently. Did you benchmark that case? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: possible use after free in dst_release
From: Eric Dumazet Date: Wed, 06 Jan 2016 05:57:24 -0800 > On Wed, 2016-01-06 at 00:18 -0800, Francesco Ruggeri wrote: >> dst_release should not access dst->flags after decrementing >> __refcnt to 0. The dst_entry may be in dst_busy_list and >> dst_gc_task may dst_destroy it before dst_release gets a chance >> to access dst->flags. >> >> Signed-off-by: Francesco Ruggeri >> --- >> net/core/dst.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/dst.c b/net/core/dst.c >> index e6dc772..a1656e3 100644 >> --- a/net/core/dst.c >> +++ b/net/core/dst.c >> @@ -301,12 +301,13 @@ void dst_release(struct dst_entry *dst) >> { >> if (dst) { >> int newrefcnt; >> +unsigned short nocache = dst->flags & DST_NOCACHE; >> >> newrefcnt = atomic_dec_return(&dst->__refcnt); >> if (unlikely(newrefcnt < 0)) >> net_warn_ratelimited("%s: dst:%p refcnt:%d\n", >> __func__, dst, newrefcnt); >> -if (!newrefcnt && unlikely(dst->flags & DST_NOCACHE)) >> +if (!newrefcnt && unlikely(nocache)) >> call_rcu(&dst->rcu_head, dst_destroy_rcu); >> } >> } > > You're right. > > My previous fix (commit d69bbf88c8d0) was okay only for DST_NOCACHE dst. > > Fixes: d69bbf88c8d0 ("net: fix a race in dst_release()") > Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst") > Acked-by: Eric Dumazet Applied and queued up for -stable, thanks everyone! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation
On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov wrote: > Looks like this happens because ip_options_fragment() relies on > correct ip options length in ip control block in skb. But in > ip_finish_output_gso() control block in segments is reused by > skb_gso_segment(). following ip_fragment() sees some garbage. > > In my case there was no ip options but length becomes non-zero and > ip_options_fragment() picked some bytes from payload and decides to > fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags > in skb_shared_info at the end of data =) > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment() since all the gso information should be saved in shared_info after it finishes. Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] geneve: break dependency to network drivers
On Wed, Jan 6, 2016 at 10:48 AM, Hannes Frederic Sowa wrote: > On 06.01.2016 19:00, Jesse Gross wrote: >> >> On Wed, Jan 6, 2016 at 7:41 AM, Hannes Frederic Sowa >> wrote: >>> >>> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c >>> index 24b077a32c1c9c..548925d1571cb1 100644 >>> --- a/drivers/net/geneve.c >>> +++ b/drivers/net/geneve.c >>> +static int geneve_notifier(struct notifier_block *unused, >>> + unsigned long event, void *ptr) >>> +{ >>> + struct net_device *dev = netdev_notifier_info_to_dev(ptr); >>> + >>> + switch (event) { >>> + case NETDEV_REFRESH_OFFLOAD_VXLAN: >>> + geneve_notify_refresh_netdev(dev); >> >> >> Presumably this should be NETDEV_REFRESH_OFFLOAD_GENEVE, not VXLAN. >> However, rather than having a notifier for each protocol, it seems >> like it might be cleaner to just have a single one that triggers for >> all protocols and drivers that don't have the corresponding NDO >> wouldn't get called, similar to what happens when the port gets added >> in the first place. > > > Ah, thanks for noticing the typo. > > The reason why I went with several types is that I didn't want to change the > behavior and wasn't sure if driver tested with reoccurring offload refreshes > to the driver. What you described was my first patch but because I couldn't > see if that works for all drivers I went this way. Hmm, I see what you mean but I think it should be safe. All drivers that have both Geneve and VXLAN offloads make calls to refresh them back to back, which is a pattern that I would expect to continue. In that case, having a single notifier that triggers multiple protocols would have the same effect and is simpler. >>> diff --git a/include/net/geneve.h b/include/net/geneve.h >>> index e6c23dc765f7ec..36245115143652 100644 >>> --- a/include/net/geneve.h >>> +++ b/include/net/geneve.h >> >> [...] >>> >>> static inline void geneve_get_rx_port(struct net_device *netdev) >>> { >>> + call_netdevice_notifiers(NETDEV_REFRESH_OFFLOAD_GENEVE, netdev); >>> } >> >> >> Unfortunately, I don't think that we can assume that RTNL is held >> here. It actually is for the drivers that implement Geneve at this >> point but not in all cases for VXLAN. For example, ixgbe refreshes the >> offloads in a service task in addition to when it is opened. There's >> only a couple instances of things like this, so I guess it's probably >> not too hard to through and make sure that we hold RTNL in those >> cases. > > > Hmm, I am tempted to switch over to the netevent atomic notifier chain and > install those events there. It does not need rtnl lock at all, so we can > preserve the current semantics. What do you think? I think that holding RTNL while we do these updates is actually the right thing to do. The current situation of having calls from different protocols protected by different locks is not really a great model given that at the driver level these are usually shared data structures. RTNL is already held in the majority of cases already, so I think it is better to just convert the rest. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html