Re: [PATCH net v2 2/2] udp: only do GSO if # of segs > 1

2019-10-02 Thread Duyck, Alexander H
On Wed, 2019-10-02 at 13:29 -0400, Josh Hunt wrote:
> Prior to this change an application sending <= 1MSS worth of data and
> enabling UDP GSO would fail if the system had SW GSO enabled, but the
> same send would succeed if HW GSO offload is enabled. In addition to this
> inconsistency the error in the SW GSO case does not get back to the
> application if sending out of a real device so the user is unaware of this
> failure.
> 
> With this change we only perform GSO if the # of segments is > 1 even
> if the application has enabled segmentation. I've also updated the
> relevant udpgso selftests.
> 
> Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
> Signed-off-by: Josh Hunt 
> ---
>  net/ipv4/udp.c   | 11 +++
>  net/ipv6/udp.c   | 11 +++
>  tools/testing/selftests/net/udpgso.c | 16 
>  3 files changed, 18 insertions(+), 20 deletions(-)

Reviewed-by: Alexander Duyck 




Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Duyck, Alexander H
On Thu, 2018-11-29 at 15:58 -0500, Debabrata Banerjee wrote:
> Fix packet count when using vlan/macvlan drivers with gso. Without this it
> is not possible to reconcile packet counts between underlying devices and
> these virtual devices. Additionally, the output looks wrong in a standalone
> way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.
> 
> There are many other drivers that likely have a similar problem, although
> it is not clear how many of those could be used with gso. Perhaps all
> packet counting should be wrapped in a helper fn.
> 
> Signed-off-by: Debabrata Banerjee 
> ---
>  drivers/net/macvlan.c | 2 +-
>  net/8021q/vlan_core.c | 2 +-
>  net/8021q/vlan_dev.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index fc8d5f1ee1ad..15e67a87f202 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -566,7 +566,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
>  
>   pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>   u64_stats_update_begin(&pcpu_stats->syncp);
> - pcpu_stats->tx_packets++;
> + pcpu_stats->tx_packets += max_t(u16, 1, 
> skb_shinfo(skb)->gso_segs);
>   pcpu_stats->tx_bytes += len;
>   u64_stats_update_end(&pcpu_stats->syncp);
>   } else {
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index a313165e7a67..e85f6665d0ed 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -62,7 +62,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
>   rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats);
>  
>   u64_stats_update_begin(&rx_stats->syncp);
> - rx_stats->rx_packets++;
> + rx_stats->rx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
>   rx_stats->rx_bytes += skb->len;
>   if (skb->pkt_type == PACKET_MULTICAST)
>   rx_stats->rx_multicast++;
> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
> index b2d9c8f27cd7..b28e7535a0b9 100644
> --- a/net/8021q/vlan_dev.c
> +++ b/net/8021q/vlan_dev.c
> @@ -135,7 +135,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct 
> sk_buff *skb,
>  
>   stats = this_cpu_ptr(vlan->vlan_pcpu_stats);
>   u64_stats_update_begin(&stats->syncp);
> - stats->tx_packets++;
> + stats->tx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
>   stats->tx_bytes += len;
>   u64_stats_update_end(&stats->syncp);
>   } else {

Instead of just checking for the max it might make more sense to do a
check using skb_is_gso, and then if true use gso_segs, otherwise just
default to 1.

Also your bytes are going to be totally messed up as well since the
headers are trimmed in the GSO frames. It might be worthwhile to just
have a branch based on skb_is_gso that sets the packets and bytes based
on the GSO values, and one that sets it for the default case.


smime.p7s
Description: S/MIME cryptographic signature


Re: [patch net-next 01/20] net: sched: add block bind/unbind notif. and extended block_get/put

2017-10-17 Thread Duyck, Alexander H
On Tue, 2017-10-17 at 22:05 +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Introduce new type of ndo_setup_tc message to propage binding/unbinding
> of a block to driver. Call this ndo whenever qdisc gets/puts a block.
> Alongside with this, there's need to propagate binder type from qdisc
> code down to the notifier. So introduce extended variants of
> block_get/put in order to pass this info.
> 
> Signed-off-by: Jiri Pirko 
> ---
>  include/linux/netdevice.h |  1 +
>  include/net/pkt_cls.h | 40 +
>  net/sched/cls_api.c   | 56 
> ---
>  3 files changed, 94 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 31bb301..062a4f5 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -771,6 +771,7 @@ typedef u16 (*select_queue_fallback_t)(struct net_device 
> *dev,
>  
>  enum tc_setup_type {
>   TC_SETUP_MQPRIO,
> + TC_SETUP_BLOCK,
>   TC_SETUP_CLSU32,
>   TC_SETUP_CLSFLOWER,
>   TC_SETUP_CLSMATCHALL,

I'm not a big fan of adding this to the middle of the enum. It will
make it harder for people that have to backport changes and such since
it is reordering values that are passed as a part of the kabi between
drivers and the kernel.

Also does this patch set really need to be 20 patches long? Seems like
you could have done this as a set of 8 and another of 12 since you need
about 8 patches to get to the point where you start pulling the code
out of the drivers.

- Alex

RE: [PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module

2017-10-17 Thread Duyck, Alexander H
> -Original Message-
> From: Eric Dumazet [mailto:eduma...@google.com]
> Sent: Tuesday, October 17, 2017 4:21 AM
> To: Henrik Austad 
> Cc: netdev ; David S . Miller
> ; Daniel Borkmann ; David
> Ahern ; Duyck, Alexander H
> ; Willem de Bruijn ;
> John Fastabend ; tcharding ; LKML
> ; Henrik Austad ; Sanchez-
> Palencia, Jesus 
> Subject: Re: [PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to
> compile as module
> 
> On Tue, Oct 17, 2017 at 3:10 AM, Henrik Austad  wrote:
> > In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW
> > traffic classes and misc IDs") sch_mqprio started using
> > netdev_txq_to_tc to find the correct tc instead of dev->tc_to_txq[]
> >
> > However, when mqprio is compiled as a module, it cannot resolve the
> > symbol, leading to this error:
> >
> >  ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!
> >
> > This adds an EXPORT_SYMBOL() since the other user in the kernel
> > (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
> > sysfs-callback.
> >
> > Cc: Alexander Duyck 
> > Cc: Jesus Sanchez-Palencia 
> > Cc: David S. Miller 
> > Signed-off-by: Henrik Austad 
> 
> 
> Reviewed-by: Eric Dumazet 

This is identical to a patch I submitted yesterday when I got the report from 
the kbuild robot. I would say your description looks much better than mine 
though so I would be good with dropping my patch in favor of this one.

Acked-by: Alexander Duyck 


Re: [PATCH v2] gso: fix payload length when gso_size is zero

2017-10-06 Thread Duyck, Alexander H
On Fri, 2017-10-06 at 19:02 +0300, Alexey Kodanev wrote:
> When gso_size reset to zero for the tail segment in skb_segment(), later
> in ipv6_gso_segment(), __skb_udp_tunnel_segment() and gre_gso_segment()
> we will get incorrect results (payload length, pcsum) for that segment.
> inet_gso_segment() already has a check for gso_size before calculating
> payload.
> 
> The issue was found with LTP vxlan & gre tests over ixgbe NIC.
> 
> Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list 
> pointer")
> Signed-off-by: Alexey Kodanev 

Acked-by: Alexander Duyck 

> ---
> v2: also added skb_is_gso to gre_gso_segment() and __skb_udp_tunnel_segment()
> 
>  net/ipv4/gre_offload.c | 2 +-
>  net/ipv4/udp_offload.c | 2 +-
>  net/ipv6/ip6_offload.c | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
> index d5cac99..8c72034 100644
> --- a/net/ipv4/gre_offload.c
> +++ b/net/ipv4/gre_offload.c
> @@ -98,7 +98,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
>   greh = (struct gre_base_hdr *)skb_transport_header(skb);
>   pcsum = (__sum16 *)(greh + 1);
>  
> - if (gso_partial) {
> + if (gso_partial && skb_is_gso(skb)) {
>   unsigned int partial_adj;
>  
>   /* Adjust checksum to account for the fact that
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 0932c85..6401574 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -122,7 +122,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct 
> sk_buff *skb,
>* will be using a length value equal to only one MSS sized
>* segment instead of the entire frame.
>*/
> - if (gso_partial) {
> + if (gso_partial && skb_is_gso(skb)) {
>   uh->len = htons(skb_shinfo(skb)->gso_size +
>   SKB_GSO_CB(skb)->data_offset +
>   skb->head - (unsigned char *)uh);
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index cdb3728..4a87f94 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -105,7 +105,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff 
> *skb,
>  
>   for (skb = segs; skb; skb = skb->next) {
>   ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
> - if (gso_partial)
> + if (gso_partial && skb_is_gso(skb))
>   payload_len = skb_shinfo(skb)->gso_size +
> SKB_GSO_CB(skb)->data_offset +
> skb->head - (unsigned char *)(ipv6h + 1);


Re: [PATCH] ipv6: gso: fix payload length when gso_size is zero

2017-10-05 Thread Duyck, Alexander H
On Thu, 2017-10-05 at 20:06 +0300, Alexey Kodanev wrote:
> When gso_size reset to zero for the tail segment in skb_segment(), later
> in ipv6_gso_segment(), we will get incorrect payload_len for that segment.
> inet_gso_segment() already has a check for gso_size before calculating
> payload so fixing only IPv6 part.
> 
> The issue was found with LTP vxlan & gre tests over ixgbe NIC.
> 
> Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list 
> pointer")
> Signed-off-by: Alexey Kodanev 
> ---
>  net/ipv6/ip6_offload.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index cdb3728..4a87f94 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -105,7 +105,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff 
> *skb,
>  
>   for (skb = segs; skb; skb = skb->next) {
>   ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
> - if (gso_partial)
> + if (gso_partial && skb_is_gso(skb))
>   payload_len = skb_shinfo(skb)->gso_size +
> SKB_GSO_CB(skb)->data_offset +
> skb->head - (unsigned char *)(ipv6h + 1);

So looking over this change it looks good to me. I'm just wondering if
you have looked at the code in __skb_udp_tunnel_segment or
gre_gso_segment? It seems like if you needed this change here you
should need to make similar changes to those functions as well. I'm
wondering if we just aren't seeing issues due to the segments already
being MSS sized before being handed to us for segmentation.

- Alex

Re: XDP redirect measurements, gotchas and tracepoints

2017-08-22 Thread Duyck, Alexander H
On Tue, 2017-08-22 at 11:17 -0700, John Fastabend wrote:
> On 08/22/2017 11:02 AM, Michael Chan wrote:
> > On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
> >  wrote:
> > > 
> > > I'be been playing with the latest XDP_REDIRECT feature, that was
> > > accepted in net-next (for ixgbe), see merge commit[1].
> > >  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
> > > 
> > 
> > Just catching on XDP_REDIRECT and I have a very basic question.  The
> > ingress device passes the XDP buffer to the egress device for XDP
> > redirect transmission.  When the egress device has transmitted the
> > packet, is it supposed to just free the buffer?  Or is it supposed to
> > be recycled?
> > 
> > In XDP_TX, the buffer is recycled back to the rx ring.
> > 
> 
> With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
> page_frag_free() on the data. There is no way to know where the xdp
> buffer came from it could be a different NIC for example.
> 
> However with how ixgbe is coded up recycling will work as long as
> the memory is free'd before the driver ring tries to use it again. In
> normal usage this should be the case. And if we are over-running a device
> it doesn't really hurt to slow down the sender a bit.
> 
> I think this is a pretty good model, we could probably provide a set
> of APIs for drivers to use so that we get some consistency across
> vendors here, ala Jesper's page pool ideas.
> 
> (+Alex, for ixgbe details)
> 
> Thanks,
> John

I think you pretty much covered the inner workings for the ixgbe bits.

The only piece I would add is that the recycling trick normally only
works if the same interface/driver is doing both the Tx and the Rx. The
redirect code cannot assume that is the case and that is the reason why
it must always be freeing the traffic on clean-up.

- Alex

Re: [Intel-wired-lan] [i40e] regression on TCP stream and TCP maerts, kernel-4.12.0-0.rc2

2017-06-01 Thread Duyck, Alexander H
On Thu, 2017-06-01 at 12:14 +0200, Adrian Tomasov wrote:
> On Wed, 2017-05-31 at 14:42 -0700, Alexander Duyck wrote:
> > 
> > On Wed, May 31, 2017 at 6:48 AM, Adrian Tomasov 
> > wrote:
> > > 
> > > 
> > > On Tue, 2017-05-30 at 18:27 -0700, Alexander Duyck wrote:
> > > > 
> > > > 
> > > > On Tue, May 30, 2017 at 8:41 AM, Alexander Duyck
> > > >  wrote:
> > > > > 
> > > > > 
> > > > > 
> > > > > On Tue, May 30, 2017 at 6:43 AM, Adam Okuliar  > > > > com>
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > we found regression on intel card(XL710) with i40e driver.
> > > > > > Regression is
> > > > > > about ~45%
> > > > > > on TCP_STREAM and TCP_MAERTS test for IPv4 and IPv6.
> > > > > > Regression
> > > > > > was first
> > > > > > visible in kernel-4.12.0-0.rc1.
> > > > > > 
> > > > > > More details about results you can see in uploaded images in
> > > > > > bugzilla. [0]
> > > > > > 
> > > > > > 
> > > > > > [0] https://bugzilla.kernel.org/show_bug.cgi?id=195923
> > > > > > 
> > > > > > 
> > > > > > Best regards, / S pozdravom,
> > > > > > 
> > > > > > Adrián Tomašov
> > > > > > Kernel Performance QE
> > > > > > atoma...@redhat.com
> > > > > 
> > > > > I have added the i40e driver maintainer and the intel-wired-lan
> > > > > mailing list so that we can make are developers aware of the
> > > > > issue.
> > > > > 
> > > > > Thanks.
> > > > > 
> > > > > - Alex
> > > > 
> > > > Adam,
> > > > 
> > > > We are having some issues trying to reproduce what you reported.
> > > > 
> > > > Can you provide some additional data. Specifically we would be
> > > > looking
> > > > for an "ethtool -i", and an "ethtool -S" for the port before and
> > > > after
> > > > the test. If you can attach it to the bugzilla that would be
> > > > appreciated.
> > > > 
> > > > Thanks.
> > > > 
> > > > - Alex
> > > 
> > > Hello Alex,
> > > 
> > > requested files are updated in bugzilla.
> > > 
> > > If you have any questions about testing feel free to ask.
> > > 
> > > 
> > > Best regards,
> > > 
> > > Adrian
> > 
> > So looking at the data I wonder if we don't have an MTU mismatch in
> > the network config. I notice the "after" has rx_length_errors being
> > reported. Recent changes made it so that i40e doesn't support jumbo
> > frames by default, whereas before we could. You might want to check
> > for that as that could cause the kind of performance issues you are
> > seeing.
> > 
> > - Alex
> 
> There isn't MTU mismatch. Traffic path is : server -> switch ->
> server. 
> 
> 
> Output from switch:
> 
> > show interfaces et-0/0/18
> Physical interface: et-0/0/18, Enabled, Physical link is Up
>   Interface index: 644, SNMP ifIndex: 538
>   Link-level type: Ethernet, MTU: 1514, Speed: 40Gbps, BPDU Error:
> None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering:
> Disabled, Flow control: Disabled, Media type: Fiber
>   Device flags   : Present Running
>   Interface flags: SNMP-Traps Internal: 0x4000
>   Link flags : None
>   CoS queues : 12 supported, 12 maximum usable queues
>   Current address: d4:04:ff:90:5a:4b, Hardware address:
> d4:04:ff:90:5a:4b
>   Last flapped   : 2017-06-01 10:09:32 CEST (01:21:29 ago)
>   Input rate : 432 bps (0 pps)
>   Output rate: 8336 bps (11 pps)
>   Active alarms  : None
>   Active defects : None
>   Interface transmit statistics: Disabled
> 
>   Logical interface et-0/0/18.0 (Index 552) (SNMP ifIndex 539)
> Flags: SNMP-Traps 0x24024000 Encapsulation: Ethernet-Bridge
> Input packets : 464041
> Output packets: 209210
> Protocol eth-switch, MTU: 1514
>   Flags: Is-Primary, Trunk-Mode
> 
> 
> MTU is same for all et-0/0/x interfaces. 
> 
> - Adrian

One thing you might try try doing is toggling the legacy-rx flag using
the "ethtool --show-priv-flags/--set-priv-flags" command to see if that
has any impact. That will help to rule things out as the most
significant change I can think of is the recent update of the Rx path
to support XDP.

Also one other thing you might try would be to use a fixed interrupt
moderation rate by locking things down using "ethtool -C" to disable
adaptive interrupt moderation and lock the Rx usecs and Tx usecs at
some predefined values. I seem to recall there have been some interrupt
moderation changes made recently that might be impacting the
performance.

Beyond that is there any chance you would be able to bisect the issue?
Unfortunately we haven't be able to reproduce it internally so anything
that would help us to narrow down the problem would be useful.

Thanks.

- Alex

Re: [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio

2017-05-22 Thread Duyck, Alexander H
On Mon, 2017-05-22 at 06:25 +0300, Or Gerlitz wrote:
> On Mon, May 22, 2017 at 1:35 AM, Alexander Duyck
>  wrote:
> > 
> > On Sat, May 20, 2017 at 2:15 PM, Or Gerlitz  wrote:
> > > 
> > > On Sat, May 20, 2017 at 3:58 AM, Amritha Nambiar
> > >  wrote:
> > > > 
> > > > The following series introduces a new harware offload mode in tc/mqprio
> > > 
> > > Wait, we have already a HW QoS model introduced by John F and Co
> > > couple of years ago,  right?
> > 
> > I assume you are referring to the ETS portion of DCBX? If so then yes
> > we have something there, but there is a fairly high level of
> > complexity and dependencies in order to enable that. What we have in
> > mind doesn't really fit with DCBX and the problem is the spec calls
> > out that you really have to have support for DCBX in order to make use
> > of ETS. What we are trying to do here is provide a lightweight way of
> > doing this configuration similar to how mqprio was already providing a
> > lightweight way of enabling multiple traffic classes.
> 
> UAPI wise, we are talking on DCB, not DCBX, right? the X-portion comes
> into play if some user-space entity run LLDP traffic and calls into
> the kernel to configure (say) ETS. If there are some issues to use the
> existing user-space tool (lldpad tool/daemon) to do DCB without X, one
> can/should fix them or introduce another/simpler tool that in a
> lightweight manner only configures things w.o exchanging.
> 
> So to clarify, the essence of this series is introducing a 2nd way to
> configure ETS and rate limit?

Sort of. Basically the idea is we can use several different approaches
to enable queue configuration and rate limits. So we are adding two
pieces of functionality.

The first block allows for configuring queue counts and layout.
Historically DCB/DCBX hasn't allowed us to specify that.

The second bit is that we added support for rate limiting. I am
actually basing it on what we had for SR-IOV rate limiting as that is
how this is working in i40e. However the basic attributes we are adding
should work for most ETS type use cases though it might need to use the
min rates instead of the max rates as we do in i40e.

> > 
> > > 
> > > Please elaborate in few sentence if you are extending it/replacing it,
> > > why we want to do that and what are the implications on existing
> > > applications UAPI wise.
> 
> > 
> > This is meant to be an extension of the existing structure. It can be
> > ignored by the driver and instead only have the basic hw offload
> > supported. In such a case the command will still return success but
> > any rate limits and queue configuration can be ignored. In the case of
> > the current implementation the queue configuration was already being
> > ignored so we opted to re-purpose the "hw" flag so that if you pass a
> > value greater than 1 and the driver doesn't recognize the value it has
> > the option to just ignore the extra bits it doesn't recognize and
> > return 1 as it did before for the hw flag in mqprio.
> 
> So the user asked to configure something and the kernel returned
> success but the config was not plugged to the hw? sounds to me like
> possible (probable) source of troubles and complaints.

That is possible, however the issue already existed. The queue
configuration could be specified with the mqprio configuration and be
totally ignored. I opted for just maintaining the existing behavior and
moving forward and providing some input via the ability to report what
"version" of the hardware offload we are supporting.

The plan is that legacy devices can fall back into the setup where they
support mode 1, however if they support hw mode 2 then we will fail to
initialize if we don't support something that is being requested.

> > 
> > > 
> > > Below you just say in the new mode Qos is configured with knobs XYZ --
> > > this is way not enough
> 
> > 
> > Can you clarify what you are asking for here? Amritha included an
> > example in patch 1 of a command line that enabled 2 traffic classes
> > with rate limits. Is there something more specific you are looking for?
> 
> you were referring to the questions I posted, so we can continue the
> discussion from here.
> ___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Re: [PATCH net v2] i40e/i40evf: proper update of the page_offset field

2017-05-15 Thread Duyck, Alexander H
On Mon, 2017-05-15 at 06:52 +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> In f8b45b74cc62 ("i40e/i40evf: Use build_skb to build frames")
> i40e_build_skb updates the page_offset field with an incorrect offset,
> which can lead to data corruption. This patch updates page_offset
> correctly, by properly setting truesize.
> 
> Note that the bug only appears on architectures where PAGE_SIZE is
> 8192 or larger.
> 
> Fixes: f8b45b74cc62 ("i40e/i40evf: Use build_skb to build frames")
> Signed-off-by: Björn Töpel 

Acked-by: Alexander Duyck 

> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 3 ++-
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 29321a6167a6..cd894f4023b1 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -1854,7 +1854,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring 
> *rx_ring,
>  #if (PAGE_SIZE < 8192)
>   unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>  #else
> - unsigned int truesize = SKB_DATA_ALIGN(size);
> + unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> + SKB_DATA_ALIGN(I40E_SKB_PAD + size);
>  #endif
>   struct sk_buff *skb;
>  
> diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
> b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
> index dfe241a12ad0..12b02e530503 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
> @@ -1190,7 +1190,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring 
> *rx_ring,
>  #if (PAGE_SIZE < 8192)
>   unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>  #else
> - unsigned int truesize = SKB_DATA_ALIGN(size);
> + unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> + SKB_DATA_ALIGN(I40E_SKB_PAD + size);
>  #endif
>   struct sk_buff *skb;
>  


RE: [PATCH net-next 8/9] ipvlan: improve compiler hints

2017-04-27 Thread Duyck, Alexander H
> -Original Message-
> From: Chiappero, Marco
> Sent: Thursday, April 27, 2017 7:52 AM
> To: netdev@vger.kernel.org
> Cc: David S . Miller ; Kirsher, Jeffrey T
> ; Duyck, Alexander H
> ; Grandhi, Sainath
> ; Mahesh Bandewar ;
> Chiappero, Marco 
> Subject: [PATCH net-next 8/9] ipvlan: improve compiler hints
> 
> Extend inlining and branch prediction hints.
> 
> Signed-off-by: Marco Chiappero 
> ---
>  drivers/net/ipvlan/ipvlan_core.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ipvlan/ipvlan_core.c 
> b/drivers/net/ipvlan/ipvlan_core.c
> index a9fc1b5..67e342d 100644
> --- a/drivers/net/ipvlan/ipvlan_core.c
> +++ b/drivers/net/ipvlan/ipvlan_core.c
> @@ -88,7 +88,7 @@ void ipvlan_ht_addr_del(struct ipvl_addr *addr)
>   hlist_del_init_rcu(&addr->hlnode);
>  }
> 
> -unsigned int ipvlan_mac_hash(const unsigned char *addr)
> +inline unsigned int ipvlan_mac_hash(const unsigned char *addr)
>  {
>   u32 hash = jhash_1word(__get_unaligned_cpu32(addr + 2),
>  ipvlan_jhash_secret);

I'm kind of surprised this isn't causing a problem with differing declarations 
between the declaration here and the declaration in ipvlan.h. Normally for 
inlining something like this you would change it to a "static inline" and move 
the entire declaration into the header file.

> @@ -505,7 +505,7 @@ static int ipvlan_xmit_mode_l2(struct sk_buff *skb,
> struct net_device *dev)
>   return ipvlan_rcv_int_frame(addr, &skb);
> 
>   skb = skb_share_check(skb, GFP_ATOMIC);
> - if (!skb)
> + if (unlikely(!skb))
>   return NET_XMIT_DROP;
> 
>   /* Packet definitely does not belong to any of the @@ -596,7
> +596,7 @@ static rx_handler_result_t ipvlan_handle_mode_l2(struct sk_buff
> **pskb,
>* when work-queue processes this frame. This is
>* achieved by returning RX_HANDLER_PASS.
>*/
> - if (nskb)
> + if (likely(nskb))
>   ipvlan_multicast_enqueue(port, nskb, false);
>   }
>   } else {
> --
> 2.9.3



RE: [Intel-wired-lan] [PATCH] igb: Allow to remove administratively set MAC on VFs

2017-04-04 Thread Duyck, Alexander H
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Corinna Vinschen
> Sent: Tuesday, April 4, 2017 8:11 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH] igb: Allow to remove administratively set 
> MAC
> on VFs
> 
>   Before libvirt modifies the MAC address and vlan tag for an SRIOV VF
>   for use by a virtual machine (either using vfio device assignment or
>   macvtap passthru mode), it saves the current MAC address and vlan tag
>   so that it can reset them to their original value when the guest is
>   done.  Libvirt can't leave the VF MAC set to the value used by the
>   now-defunct guest since it may be started again later using a
>   different VF, but it certainly shouldn't just pick any random value,
>   either. So it saves the state of everything prior to using the VF, and
>   resets it to that.
> 
>   The igb driver initializes the MAC addresses of all VFs to
>   00:00:00:00:00:00, and reports that when asked (via an RTM_GETLINK
>   netlink message, also visible in the list of VFs in the output of "ip
>   link show"). But when libvirt attempts to restore the MAC address back
>   to 00:00:00:00:00:00 (using an RTM_SETLINK netlink message) the kernel
>   responds with "Invalid argument".
> 
>   Forbidding a reset back to the original value leaves the VF MAC at the
>   value set for the now-defunct virtual machine. Especially on a system
>   with NetworkManager enabled, this has very bad consequences, since
>   NetworkManager forces all interfacess to be IFF_UP all the time - if
>   the same virtual machine is restarted using a different VF (or even on
>   a different host), there will be multiple interfaces watching for
>   traffic with the same MAC address.
> 
>   To allow libvirt to revert to the original state, we need a way to
>   remove the administrative set MAC on a VF, to allow normal host
>   operation again, and to reset/overwrite the VF MAC via VF netdev.
> 
>   This patch implements the outlined scenario by allowing to set the
>   VF MAC to 00:00:00:00:00:00 via RTM_SETLINK on the PF.
>   igb_ndo_set_vf_mac resets the IGB_VF_FLAG_PF_SET_MAC flag to 0,
>   so it's possible to reset the VF MAC back to the original value via
>   the VF netdev.
> 
>   Note: Recent patches to libvirt allow for a workaround if the NIC
>   isn't capable of resetting the administrative MAC back to all 0, but
>   in theory the NIC should allow resetting the MAC in the fisr place.

Minor typo here. I assume you mean "first place".

> Signed-off-by: Corinna Vinschen 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 25 -
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index 26a821f..e7a61b1 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -8125,12 +8125,27 @@ static int igb_set_vf_mac(struct igb_adapter
> *adapter,  static int igb_ndo_set_vf_mac(struct net_device *netdev, int vf, u8
> *mac)  {
>   struct igb_adapter *adapter = netdev_priv(netdev);
> - if (!is_valid_ether_addr(mac) || (vf >= adapter->vfs_allocated_count))
> +
> + if (vf >= adapter->vfs_allocated_count)
> + return -EINVAL;

I would add an blank line here just for readability.

> + /* Setting the VF MAC to 0 reverts the IGB_VF_FLAG_PF_SET_MAC
> +flag and allows to overwrite the MAC via VF netdev.  This
> +is necessary to allow libvirt a way to restore the original
> +MAC after unbinding vfio-pci and reloading igbvf after shutting
> +down a VM. */

Minor coding style issue here. The "*/" should be on a separate line.

> + if (is_zero_ether_addr(mac)) {
> + adapter->vf_data[vf].flags &= ~IGB_VF_FLAG_PF_SET_MAC;
> + dev_info(&adapter->pdev->dev,
> +  "remove administratively set MAC on VF %d\n",
> +  vf);
> + } else if (is_valid_ether_addr (mac)) {
> + adapter->vf_data[vf].flags |= IGB_VF_FLAG_PF_SET_MAC;
> + dev_info(&adapter->pdev->dev, "setting MAC %pM on VF
> %d\n",
> +  mac, vf);
> + dev_info(&adapter->pdev->dev,
> +  "Reload the VF driver to make this change effective.");
> + } else
>   return -EINVAL;

Minor coding style issue here. The else should also have "{}" wrapping the 
statement. Generally if any one of the statements in an if/else series needs 
the braces they should all have the braces.

> - adapter->vf_data[vf].flags |= IGB_VF_FLAG_PF_SET_MAC;
> - dev_info(&adapter->pdev->dev, "setting MAC %pM on VF %d\n", mac,
> vf);
> - dev_info(&adapter->pdev->dev,
> -  "Reload the VF driver to make this change effective.");
>   if (test_bit(__IGB_DOWN, &adapter->state)) {

You migh

RE: [net 2/2] ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines

2017-03-03 Thread Duyck, Alexander H
> -Original Message-
> From: David Laight [mailto:david.lai...@aculab.com]
> Sent: Friday, March 3, 2017 4:25 AM
> To: Kirsher, Jeffrey T ; da...@davemloft.net
> Cc: Duyck, Alexander H ;
> netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com
> Subject: RE: [net 2/2] ixgbe: Limit use of 2K buffers on architectures with 
> 256B
> or larger cache lines
> 
> From: Jeff Kirsher
> > Sent: 03 March 2017 02:25
> > From: Alexander Duyck 
> >
> > On architectures that have a cache line size larger than 64 Bytes we
> > start running into issues where the amount of headroom for the frame
> > starts shrinking.
> >
> > The size of skb_shared_info on a system with a 64B L1 cache line size
> > is 320.  This increases to 384 with a 128B cache line, and 512 with a
> > 256B cache line.
> 
> Perhaps some of the CACHE_LINE_ALIGNED markers don't actually need to
> force alignment with large line sizes?
> 
> I realise some things have hard requirements for cache alignment (eg non-
> coherent dma), but others are just there to limit the number of cache lines 
> read
> and/or dirtied.
> 
>   David

For our purposes I think this works well enough.  Basically we wanted to 
guarantee we have enough headroom for XDP.  In the case of the Mellanox drivers 
they are guaranteeing 256 if I recall correctly.

I have some follow-up patches for net-next that will make it so that we can 
just do a build-time test that will determine the padding size and allow us to 
always guaranteed at least NET_SKB_PAD + NET_IP_ALIGN.

- Alex




Re: [patch net-next 1/7] ipv4: fib: Only flush FIB aliases belonging to currently flushed table

2017-02-09 Thread Duyck, Alexander H
On Thu, 2017-02-09 at 10:28 +0100, Jiri Pirko wrote:
> From: Ido Schimmel 
> 
> In case the MAIN table is flushed and its trie is shared with the LOCAL
> table, then we might be flushing FIB aliases belonging to the latter.
> This can lead to FIB_ENTRY_DEL notifications sent with the wrong table
> ID.
> 
> The above doesn't affect current listeners, as the table ID is ignored
> during entry deletion, but this will change later in the patchset.
> 
> When flushing a particular table, skip any aliases belonging to a
> different one.
> 
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 
> CC: Alexander Duyck 
> CC: Patrick McHardy 
> ---
>  net/ipv4/fib_trie.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 2919d1a..5ef4596 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1963,7 +1963,8 @@ int fib_table_flush(struct net *net, struct fib_table 
> *tb)
>   hlist_for_each_entry_safe(fa, tmp, &n->leaf, fa_list) {
>   struct fib_info *fi = fa->fa_info;
>  
> - if (!fi || !(fi->fib_flags & RTNH_F_DEAD)) {
> + if (!fi || !(fi->fib_flags & RTNH_F_DEAD) ||
> + tb->tb_id != fa->tb_id) {
>   slen = fa->fa_slen;
>   continue;
>   }

One change I might make if you end up having to do a v2 would be to
test for the table ID first.  It can end up saving a few cycles in the
whole flushing process since the table ID is in the fib alias instead
of having to dereference the fib info.

That being said, I am just being a bit nit-picky so the code itself is
functionally correct and there is nothing here that should cause any
issues.

Reviewed-by: Alexander Duyck 

RE: [net-next v2 02/19] i40e: simplify txd use count calculation

2016-12-07 Thread Duyck, Alexander H
> -Original Message-
> From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> Sent: Wednesday, December 7, 2016 5:04 PM
> To: Duyck, Alexander H 
> Cc: Kirsher, Jeffrey T ; da...@davemloft.net;
> Williams, Mitch A ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com;
> guru.anbalag...@oracle.com
> Subject: Re: [net-next v2 02/19] i40e: simplify txd use count calculation
> 
> On Thu, 2016-12-08 at 00:35 +, Duyck, Alexander H wrote:
> 
> > Well there ends up being a few aspects to it.  First we don't need the
> > precision of a full 64b inverse multiplication, that is why we can get
> > away with multiple by 85 and shift.  The assumption is we should never
> > see a buffer larger than 64K for a TSO frame.  That being the case we
> > can do the same thing without having to use a 64b value which isn't an
> > option on 32b architectures.
> >
> > So basically what it comes down to is dealing with the "optimized for
> > size" kernel option, and 32b architectures not being able to do this.
> > Arguably both are corner cases but better to deal with them than take
> > a performance hit we don't have to.
> 
> ok ok ;)
> 
> Too bad the 65536 value is accepted, (is it ?) otherwise
> 
> unsigned int foo(unsigned short size)
> {
>   return size / 0x3000;
> }
> 
> -> generates the same kind of instructions, with maybe a better
> precision.
> 
> foo:
>   movzwl  %di, %eax
>   imull   $43691, %eax, %eax
>   shrl$29, %eax
>   ret

I haven't ever looked all that closely.  I'm assuming the frame size can 
probably exceed by at least ETH_HLEN - 1 since the IP header length can 
theoretically reach 64K - 1 before we are maxed out.

We don't really need the extra precision anyway.  We will be off by 1 most 
likely anyway in many cases since the actual hardware can handle up to 16K - 1 
in either the first and/or last buffer of any series since the actual limit is 
the 14 bits in the Tx descriptor for reporting the buffer length and we try to 
keep the split between buffers 4K aligned.

- Alex


RE: [net-next v2 02/19] i40e: simplify txd use count calculation

2016-12-07 Thread Duyck, Alexander H
> -Original Message-
> From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> Sent: Wednesday, December 7, 2016 4:16 PM
> To: Kirsher, Jeffrey T 
> Cc: da...@davemloft.net; Williams, Mitch A ;
> netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; guru.anbalag...@oracle.com; Duyck, Alexander H
> 
> Subject: Re: [net-next v2 02/19] i40e: simplify txd use count calculation
> 
> On Wed, 2016-12-07 at 14:19 -0800, Jeff Kirsher wrote:
> > From: Mitch Williams 
> >
> > The i40e_txd_use_count function was fast but confusing. In the
> > comments, it even admits that it's ugly. So replace it with a new
> > function that is
> > (very) slightly faster and has extensive commenting to help the
> > thicker among us (including the author, who will forget in a week)
> > understand how it works.
> >
> > Change-ID: Ifb533f13786a0bf39cb29f77969a5be2c83d9a87
> > Signed-off-by: Mitch Williams 
> > Signed-off-by: Alexander Duyck 
> > Tested-by: Andrew Bowers 
> > Signed-off-by: Jeff Kirsher 
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h   | 45 
> > +-
> -
> >  drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 45
> > +--
> >  2 files changed, 56 insertions(+), 34 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index de8550f..e065321 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -173,26 +173,37 @@ static inline bool i40e_test_staterr(union
> > i40e_rx_desc *rx_desc,  #define I40E_MAX_DATA_PER_TXD_ALIGNED \
> > (I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
> >
> > -/* This ugly bit of math is equivalent to DIV_ROUNDUP(size, X) where
> > X is
> > - * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the
> > fact
> > - * that 12K is not a power of 2 and division is expensive.  It is
> > used to
> > - * approximate the number of descriptors used per linear buffer.
> > Note
> > - * that this will overestimate in some cases as it doesn't account
> > for the
> > - * fact that we will add up to 4K - 1 in aligning the 12K buffer,
> > however
> > - * the error should not impact things much as large buffers usually
> > mean
> > - * we will use fewer descriptors then there are frags in an skb.
> > +/**
> > + * i40e_txd_use_count  - estimate the number of descriptors needed
> > +for Tx
> > + * @size: transmit request size in bytes
> > + *
> > + * Due to hardware alignment restrictions (4K alignment), we need to
> > + * assume that we can have no more than 12K of data per descriptor,
> > +even
> > + * though each descriptor can take up to 16K - 1 bytes of aligned memory.
> > + * Thus, we need to divide by 12K. But division is slow! Instead,
> > + * we decompose the operation into shifts and one relatively cheap
> > + * multiply operation.
> > + *
> > + * To divide by 12K, we first divide by 4K, then divide by 3:
> > + * To divide by 4K, shift right by 12 bits
> > + * To divide by 3, multiply by 85, then divide by 256
> > + * (Divide by 256 is done by shifting right by 8 bits)
> > + * Finally, we add one to round up. Because 256 isn't an exact
> > +multiple of
> > + * 3, we'll underestimate near each multiple of 12K. This is actually
> > +more
> > + * accurate as we have 4K - 1 of wiggle room that we can fit into the
> > +last
> > + * segment.  For our purposes this is accurate out to 1M which is
> > +orders of
> > + * magnitude greater than our largest possible GSO size.
> > + *
> > + * This would then be implemented as:
> > + * return (((size >> 12) * 85) >> 8) + 1;
> > + *
> > + * Since multiplication and division are commutative, we can reorder
> > + * operations into:
> > + * return ((size * 85) >> 20) + 1;
> >   */
> >  static inline unsigned int i40e_txd_use_count(unsigned int size)  {
> > -   const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
> > -   const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
> > -   unsigned int adjust = ~(u32)0;
> > -
> > -   /* if we rounded up on the reciprocal pull down the adjustment */
> > -   if ((max * reciprocal) > adjust)
> > -   adjust = ~(u32)(reciprocal - 1);
> > -
> > -   return (u32)u64)size * reciprocal) + adjust) >> 32);
> > +   return ((

Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code

2016-12-05 Thread Duyck, Alexander H
On Mon, 2016-12-05 at 12:28 -0500, David Miller wrote:
> From: Robert Shearman 
> Date: Mon, 5 Dec 2016 15:05:18 +
> 
> > 
> > On 01/12/16 12:27, Alexander Duyck wrote:
> > > 
> > > It has been reported that update_suffix can be expensive when it is
> > > called
> > > on a large node in which most of the suffix lengths are the same.  The
> > > time
> > > required to add 200K entries had increased from around 3 seconds to
> > > almost
> > > 49 seconds.
> > > 
> > > In order to address this we need to move the code for updating the
> > > suffix
> > > out of resize and instead just have it handled in the cases where we
> > > are
> > > pushing a node that increases the suffix length, or will decrease the
> > > suffix length.
> > > 
> > > Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
> > > Reported-by: Robert Shearman 
> > > Signed-off-by: Alexander Duyck 
> > 
> > $ time sudo ip route restore < ~/allroutes
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> > RTNETLINK answers: File exists
> 
> What are these errors all about?

I think it is the fact that he is trying to restore "all routes" and
some of the routes already exist such as those associated with his
default network interface.

- Alex

RE: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required

2016-12-02 Thread Duyck, Alexander H
> -Original Message-
> From: Robert Shearman [mailto:rshea...@brocade.com]
> Sent: Friday, December 2, 2016 5:54 AM
> To: Alexander Duyck 
> Cc: David Miller ; Netdev ;
> Duyck, Alexander H 
> Subject: Re: [PATCH net] fib_trie: Avoid expensive update of suffix length 
> when
> not required
> 
> On 01/12/16 18:29, Alexander Duyck wrote:
> > On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman
>  wrote:
> >> On 29/11/16 23:14, Alexander Duyck wrote:
> >>> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman
> >>> 
> >>>>
> >>>> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix
> >>>> length")
> >>>> Signed-off-by: Robert Shearman 
> >>>
> >>>
> >>> So I am not entirely convinced this is a regression, I was wondering
> >>> what your numbers looked like before the patch you are saying this
> >>> fixes?  I recall I fixed a number of issues leading up to this so I
> >>> am just curious.
> >>
> >>
> >> At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie:
> Remove
> >> checks for index >= tnode_child_length from tnode_get_child") which
> >> is the parent of 5405afd1a306:
> >>
> >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File
> >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists
> >> RTNETLINK answers: File exists
> >>
> >> real0m3.451s
> >> user0m0.184s
> >> sys 0m2.004s
> >>
> >>
> >> At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add
> >> tracking value for suffix length"):
> >>
> >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File
> >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists
> >> RTNETLINK answers: File exists
> >>
> >> real0m48.624s
> >> user0m0.360s
> >> sys 0m46.960s
> >>
> >> So does that warrant a fixes tag for this patch?
> >
> > Yes, please include it in the patch description next time.
> >
> > I've been giving it thought and the fact is the patch series has
> > merit.  I just think it was being a bit too aggressive in terms of
> > some of the changes.  So placing any changes in put_child seem to be
> > the one piece in this that make this a bit ugly.
> 
> Does that imply the current updating of the parent's slen value should be
> removed from put_child then?

No, that is where it should be.  We want to leave the put_child code as is.  
That way all the code for halving and inflating nodes works correctly.

The only reason for having the update_suffix code update the parent was because 
before it was propagating the suffix length for increases as well.  Since we 
aren't using it for that anymore there isn't much point in having update_suffix 
touching the parent in the code below.
 
> I started out without doing the changes in put_child, but that meant the 
> changes
> to push the slen up through the trie were spread all over the place. This 
> seemed
> like the cleaner solution.

It actually makes it much messier.  The put_child call is used all over the 
place in many situations where it doesn't make sense to go through updating the 
entire trie.  The big issue is you can't walk up the trie if you are working on 
assembling a node off on the side that you plan to insert into the trie later.  
The only places we need to worry about updating the suffix are when we have 
added a new leaf/suffix, or when we have deleted a leaf/suffix.  That is why in 
the patches I submitted I only update in those two places.

> >>>> +   }
> >>>> +}
> >>>> +
> >>>>  /* Add a child at position i overwriting the old value.
> >>>>   * Update the value of full_children and empty_children.
> >>>> + *
> >>>> + * The suffix length of the parent node and the rest of the tree
> >>>> + is
> >>>> + * updated (if required) when adding/replacing a node, but is
> >>>> + caller's
> >>>> + * responsibility on removal.
> >>>>   */
> >>>>  static void put_child(struct key_vector *tn, unsigned long i,
> >>>>   struct key_vector *n) @@ -447,8 +461,8 @@
> >>>> static void put_child(struct key_vector *tn, unsigned long i,
> >>>> else if (!wasfull && isfull)
> >>>> tn_info(tn)->full_children++;
> >&

Re: [net PATCH 1/2] ipv4: Restore fib_trie_flush_external function and fix call ordering

2016-11-15 Thread Duyck, Alexander H
On Tue, 2016-11-15 at 22:07 +0100, Jiri Pirko wrote:
> Tue, Nov 15, 2016 at 10:01:03PM CET, alexander.h.du...@intel.com wrote:
> > 
> > On Tue, 2016-11-15 at 21:52 +0100, Jiri Pirko wrote:
> > > 
> > > Tue, Nov 15, 2016 at 09:49:02PM CET, alexander.h.du...@intel.com wrote:
> > > > 
> > > > 
> > > > On Tue, 2016-11-15 at 21:31 +0100, Jiri Pirko wrote:
> > > > > 
> > > > > 
> > > > > Tue, Nov 15, 2016 at 09:29:09PM CET, alexander.h.du...@intel.com 
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Tue, 2016-11-15 at 20:51 +0100, Jiri Pirko wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Tue, Nov 15, 2016 at 11:46:06AM CET, alexander.h.du...@intel.com 
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > The patch that removed the FIB offload infrastructure was a bit 
> > > > > > > > too
> > > > > > > > aggressive and also removed code needed to clean up us 
> > > > > > > > splitting the table
> > > > > > > > if additional rules were added.  Specifically the function
> > > > > > > > fib_trie_flush_external was called at the end of a new rule 
> > > > > > > > being added to
> > > > > > > > flush the foreign trie entries from the main trie.
> > > > > > > > 
> > > > > > > > I updated the code so that we only call fib_trie_flush_external 
> > > > > > > > on the main
> > > > > > > > table so that we flush the entries for local from main.  This 
> > > > > > > > way we don't
> > > > > > > > call it for every rule change which is what was happening 
> > > > > > > > previously.
> > > > > > > 
> > > > > > > Well, the function was introduced by:
> > > > > > > 
> > > > > > > commit 104616e74e0b464d449fdd2ee2f547d2fad71610
> > > > > > > Author: Scott Feldman 
> > > > > > > Date:   Thu Mar 5 21:21:16 2015 -0800
> > > > > > > 
> > > > > > > switchdev: don't support custom ip rules, for now
> > > > > > > 
> > > > > > > Keep switchdev FIB offload model simple for now and don't 
> > > > > > > allow custom ip
> > > > > > > rules.
> > > > > > > 
> > > > > > > Why this was not needed before? What changed in between:
> > > > > > > 104616e74e0b464d449fdd2ee2f547d2fad71610 ("switchdev: don't 
> > > > > > > support custom ip rules, for now")
> > > > > > > and
> > > > > > > 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
> > > > > > 
> > > > > > We collapsed the two tables into one in commit 0ddcf43d5d4a ("ipv4: 
> > > > > > FIB
> > > > > > Local/MAIN table collapse") which was submitted the next day.  Scott
> > > > > > and I were working on things at the same time and the
> > > > > > fib_table_flush_external function was something we had worked out 
> > > > > > that
> > > > > > would allow him to take care of his use case and me to take care of
> > > > > > cleaning up the tables after unmerging.
> > > > > 
> > > > > Okay. But please name the fuction differently, as it does not flush
> > > > > external. Thanks!
> > > > 
> > > > You and I have different meanings for "external".
> > > > 
> > > > In my case I am flushing entries that belong to a foreign "external"
> > > > table from the table specified. So by "external" I am referring to
> > > > entries that don't actually live in main, but actually reside in local.
> > > > If you take a look at fib_table_flush that gets rid of all entries,
> > > > fib_table_flush_external simply clears the foreign ones.
> > > > 
> > > > Also I'd rather maintain naming since it makes it easier if we need to
> > > > backport fixes.
> > > > 
> > > > Finally, the flag RTNH_F_EXTERNAL was renamed over a year ago in commit
> > > > 36583eb54d46c ("rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD") so there
> > > > isn't too much likelihood of this being confused for something that
> > > > handles offloaded entries.  If you take a look in net/ipv4/* after your
> > > > patch there isn't actually anything that references the word external
> > > > so the likelihood for any confusion is extremely low.
> > > 
> > > Okay. But if you can, please put a comment to this function in order to
> > > prevent future confusion. Thanks!
> > 
> > I'm not sure there is much left to confuse at this point.  The function
> > has gone from multi-purpose to single purpose so anyone that is messing
> > with this should only be doing so if they are messing with the unmerge
> > functionality.
> > 
> > If anything it would be more confusing to refer to functionality that
> > this function doesn't support in the comments.  All this function does
> > is flush foreign/external objects from the tree.
> > 
> > I'm willing to review a patch if you have a suggestion for a comment
> > that would work.  I just want to avoid confusing people by referring to
> > code and functionality that is no longer relevent.
> 
> Perhaps I was the only one confused. Fair enough. Thanks Alex.

No, I would agree the original sharing of the function was a bit
confusing, and that was mostly due to the timeline of things with both
Scott and I rebasing our

Re: [net PATCH 1/2] ipv4: Restore fib_trie_flush_external function and fix call ordering

2016-11-15 Thread Duyck, Alexander H
On Tue, 2016-11-15 at 21:52 +0100, Jiri Pirko wrote:
> Tue, Nov 15, 2016 at 09:49:02PM CET, alexander.h.du...@intel.com wrote:
> > 
> > On Tue, 2016-11-15 at 21:31 +0100, Jiri Pirko wrote:
> > > 
> > > Tue, Nov 15, 2016 at 09:29:09PM CET, alexander.h.du...@intel.com wrote:
> > > > 
> > > > 
> > > > On Tue, 2016-11-15 at 20:51 +0100, Jiri Pirko wrote:
> > > > > 
> > > > > 
> > > > > Tue, Nov 15, 2016 at 11:46:06AM CET, alexander.h.du...@intel.com 
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > The patch that removed the FIB offload infrastructure was a bit too
> > > > > > aggressive and also removed code needed to clean up us splitting 
> > > > > > the table
> > > > > > if additional rules were added.  Specifically the function
> > > > > > fib_trie_flush_external was called at the end of a new rule being 
> > > > > > added to
> > > > > > flush the foreign trie entries from the main trie.
> > > > > > 
> > > > > > I updated the code so that we only call fib_trie_flush_external on 
> > > > > > the main
> > > > > > table so that we flush the entries for local from main.  This way 
> > > > > > we don't
> > > > > > call it for every rule change which is what was happening 
> > > > > > previously.
> > > > > 
> > > > > Well, the function was introduced by:
> > > > > 
> > > > > commit 104616e74e0b464d449fdd2ee2f547d2fad71610
> > > > > Author: Scott Feldman 
> > > > > Date:   Thu Mar 5 21:21:16 2015 -0800
> > > > > 
> > > > > switchdev: don't support custom ip rules, for now
> > > > > 
> > > > > Keep switchdev FIB offload model simple for now and don't allow 
> > > > > custom ip
> > > > > rules.
> > > > > 
> > > > > Why this was not needed before? What changed in between:
> > > > > 104616e74e0b464d449fdd2ee2f547d2fad71610 ("switchdev: don't support 
> > > > > custom ip rules, for now")
> > > > > and
> > > > > 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
> > > > 
> > > > We collapsed the two tables into one in commit 0ddcf43d5d4a ("ipv4: FIB
> > > > Local/MAIN table collapse") which was submitted the next day.  Scott
> > > > and I were working on things at the same time and the
> > > > fib_table_flush_external function was something we had worked out that
> > > > would allow him to take care of his use case and me to take care of
> > > > cleaning up the tables after unmerging.
> > > 
> > > Okay. But please name the fuction differently, as it does not flush
> > > external. Thanks!
> > 
> > You and I have different meanings for "external".
> > 
> > In my case I am flushing entries that belong to a foreign "external"
> > table from the table specified. So by "external" I am referring to
> > entries that don't actually live in main, but actually reside in local.
> > If you take a look at fib_table_flush that gets rid of all entries,
> > fib_table_flush_external simply clears the foreign ones.
> > 
> > Also I'd rather maintain naming since it makes it easier if we need to
> > backport fixes.
> > 
> > Finally, the flag RTNH_F_EXTERNAL was renamed over a year ago in commit
> > 36583eb54d46c ("rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD") so there
> > isn't too much likelihood of this being confused for something that
> > handles offloaded entries.  If you take a look in net/ipv4/* after your
> > patch there isn't actually anything that references the word external
> > so the likelihood for any confusion is extremely low.
> 
> Okay. But if you can, please put a comment to this function in order to
> prevent future confusion. Thanks!

I'm not sure there is much left to confuse at this point.  The function
has gone from multi-purpose to single purpose so anyone that is messing
with this should only be doing so if they are messing with the unmerge
functionality.

If anything it would be more confusing to refer to functionality that
this function doesn't support in the comments.  All this function does
is flush foreign/external objects from the tree.

I'm willing to review a patch if you have a suggestion for a comment
that would work.  I just want to avoid confusing people by referring to
code and functionality that is no longer relevent.

- Alex

Re: [net PATCH 1/2] ipv4: Restore fib_trie_flush_external function and fix call ordering

2016-11-15 Thread Duyck, Alexander H
On Tue, 2016-11-15 at 21:31 +0100, Jiri Pirko wrote:
> Tue, Nov 15, 2016 at 09:29:09PM CET, alexander.h.du...@intel.com wrote:
> > 
> > On Tue, 2016-11-15 at 20:51 +0100, Jiri Pirko wrote:
> > > 
> > > Tue, Nov 15, 2016 at 11:46:06AM CET, alexander.h.du...@intel.com wrote:
> > > > 
> > > > 
> > > > The patch that removed the FIB offload infrastructure was a bit too
> > > > aggressive and also removed code needed to clean up us splitting the 
> > > > table
> > > > if additional rules were added.  Specifically the function
> > > > fib_trie_flush_external was called at the end of a new rule being added 
> > > > to
> > > > flush the foreign trie entries from the main trie.
> > > > 
> > > > I updated the code so that we only call fib_trie_flush_external on the 
> > > > main
> > > > table so that we flush the entries for local from main.  This way we 
> > > > don't
> > > > call it for every rule change which is what was happening previously.
> > > 
> > > Well, the function was introduced by:
> > > 
> > > commit 104616e74e0b464d449fdd2ee2f547d2fad71610
> > > Author: Scott Feldman 
> > > Date:   Thu Mar 5 21:21:16 2015 -0800
> > > 
> > > switchdev: don't support custom ip rules, for now
> > > 
> > > Keep switchdev FIB offload model simple for now and don't allow 
> > > custom ip
> > > rules.
> > > 
> > > Why this was not needed before? What changed in between:
> > > 104616e74e0b464d449fdd2ee2f547d2fad71610 ("switchdev: don't support 
> > > custom ip rules, for now")
> > > and
> > > 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
> > 
> > We collapsed the two tables into one in commit 0ddcf43d5d4a ("ipv4: FIB
> > Local/MAIN table collapse") which was submitted the next day.  Scott
> > and I were working on things at the same time and the
> > fib_table_flush_external function was something we had worked out that
> > would allow him to take care of his use case and me to take care of
> > cleaning up the tables after unmerging.
> 
> Okay. But please name the fuction differently, as it does not flush
> external. Thanks!

You and I have different meanings for "external".

In my case I am flushing entries that belong to a foreign "external"
table from the table specified. So by "external" I am referring to
entries that don't actually live in main, but actually reside in local.
If you take a look at fib_table_flush that gets rid of all entries,
fib_table_flush_external simply clears the foreign ones.

Also I'd rather maintain naming since it makes it easier if we need to
backport fixes.

Finally, the flag RTNH_F_EXTERNAL was renamed over a year ago in commit
36583eb54d46c ("rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD") so there
isn't too much likelihood of this being confused for something that
handles offloaded entries.  If you take a look in net/ipv4/* after your
patch there isn't actually anything that references the word external
so the likelihood for any confusion is extremely low.

- Alex


Re: [net PATCH 1/2] ipv4: Restore fib_trie_flush_external function and fix call ordering

2016-11-15 Thread Duyck, Alexander H
On Tue, 2016-11-15 at 20:51 +0100, Jiri Pirko wrote:
> Tue, Nov 15, 2016 at 11:46:06AM CET, alexander.h.du...@intel.com wrote:
> > 
> > The patch that removed the FIB offload infrastructure was a bit too
> > aggressive and also removed code needed to clean up us splitting the table
> > if additional rules were added.  Specifically the function
> > fib_trie_flush_external was called at the end of a new rule being added to
> > flush the foreign trie entries from the main trie.
> > 
> > I updated the code so that we only call fib_trie_flush_external on the main
> > table so that we flush the entries for local from main.  This way we don't
> > call it for every rule change which is what was happening previously.
> 
> Well, the function was introduced by:
> 
> commit 104616e74e0b464d449fdd2ee2f547d2fad71610
> Author: Scott Feldman 
> Date:   Thu Mar 5 21:21:16 2015 -0800
> 
> switchdev: don't support custom ip rules, for now
> 
> Keep switchdev FIB offload model simple for now and don't allow custom ip
> rules.
> 
> Why this was not needed before? What changed in between:
> 104616e74e0b464d449fdd2ee2f547d2fad71610 ("switchdev: don't support custom ip 
> rules, for now")
> and
> 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")

We collapsed the two tables into one in commit 0ddcf43d5d4a ("ipv4: FIB
Local/MAIN table collapse") which was submitted the next day.  Scott
and I were working on things at the same time and the
fib_table_flush_external function was something we had worked out that
would allow him to take care of his use case and me to take care of
cleaning up the tables after unmerging.

- Alex


Re: [net] 2ab9fb18c4: kernel BUG at include/linux/skbuff.h:1935!

2016-11-14 Thread Duyck, Alexander H
On Mon, 2016-11-14 at 07:49 +0800, kernel test robot wrote:
> FYI, we noticed the following commit:
> 
> https://github.com/0day-ci/linux 
> Eric-Dumazet/net-__skb_flow_dissect-must-cap-its-return-value/20161110-080839
> commit 2ab9fb18c46b91b16a0f0f329336d3be9fc32deb ("net: __skb_flow_dissect() 
> must cap its return value")
> 
> in testcase: kbuild
> with following parameters:
> 
>   runtime: 300s
>   nr_task: 50%
>   cpufreq_governor: performance
> 
> 
> 
> 
> on test machine: 8 threads Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz with 16G 
> memory
> 
> caused below changes:
> 
> 
> +---+++
> > 
> >   | cdb26d3387 | 
> > 2ab9fb18c4 |
> +---+++
> > 
> > boot_successes| 10 | 3  
> > |
> > boot_failures | 0  | 9  
> > |
> > kernel_BUG_at_include/linux/skbuff.h  | 0  | 8  
> > |
> > invalid_opcode:#[##]SMP   | 0  | 8  
> > |
> > RIP:eth_type_trans| 0  | 8  
> > |
> > Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0  | 5  
> > |
> > WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup | 0  | 1  
> > |
> > calltrace:parport_pc_init | 0  | 1  
> > |
> > calltrace:SyS_finit_module| 0  | 1  
> > |
> > WARNING:at_lib/kobject.c:#kobject_add_internal| 0  | 1  
> > |
> +---+++
> 
> 
> 
> [   20.491020] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [   20.502988] Sending DHCP requests .
> [   20.506729] [ cut here ]
> [   20.511369] kernel BUG at include/linux/skbuff.h:1935!
> [   20.517893] invalid opcode:  [#1] SMP
> [   20.521902] Modules linked in:
> [   20.524979] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 
> 4.9.0-rc3-00286-g2ab9fb1 #1
> [   20.532463] Hardware name: Supermicro SYS-5018A-TN4/A1SAi, BIOS 1.1a 
> 08/27/2015
> [   20.539768] task: 8804456c2480 task.stack: c9000192
> [   20.545684] RIP: 0010:[]  [] 
> eth_type_trans+0xe8/0x140
> [   20.553972] RSP: 0018:88047fd03db8  EFLAGS: 00010297
> [   20.559283] RAX: 0158 RBX: 88047d8ae600 RCX: 
> 1073
> [   20.566415] RDX: 88047bf07dc0 RSI: 88047d8a4000 RDI: 
> 88047dac0f00
> [   20.573546] RBP: 88047fd03e20 R08: 88047d8a4000 R09: 
> 0800
> [   20.580678] R10: 88047bf07ec0 R11: ea0011f6e400 R12: 
> 88047dac0f00
> [   20.587810] R13: 880457413000 R14: c90002129000 R15: 
> 015e
> [   20.594946] FS:  () GS:88047fd0() 
> knlGS:
> [   20.603032] CS:  0010 DS:  ES:  CR0: 80050033
> [   20.608775] CR2: 7fffadfb4ef0 CR3: 00047ee07000 CR4: 
> 001006e0
> [   20.615906] Stack:
> [   20.617927]  816905a7 ea0011f6e400 ea08 
> 88047d8ae450
> [   20.625403]  88047d8ae400 00400166 ea0011f6e400 
> 
> [   20.632873]  0040  88047d8ae450 
> 88047d8b1140
> [   20.640352] Call Trace:
> [   20.642805]   
> [   20.644740]  [] ? igb_clean_rx_irq+0x6a7/0x7d0
> [   20.650760]  [] igb_poll+0x382/0x700
> [   20.655904]  [] ? timerqueue_add+0x59/0xb0
> [   20.661564]  [] net_rx_action+0x217/0x360
> [   20.667137]  [] __do_softirq+0x104/0x2ab
> [   20.672624]  [] irq_exit+0xf1/0x100
> [   20.677673]  [] do_IRQ+0x54/0xd0
> [   20.682466]  [] common_interrupt+0x8c/0x8c
> [   20.688123]   
> [   20.690054]  [] ? cpuidle_enter_state+0x122/0x2e0
> [   20.696333]  [] cpuidle_enter+0x17/0x20
> [   20.701733]  [] call_cpuidle+0x23/0x40
> [   20.707045]  [] cpu_startup_entry+0x114/0x200
> [   20.712964]  [] start_secondary+0x107/0x130
> [   20.718708] Code: 00 04 00 00 c9 c3 48 33 86 70 03 00 00 48 c1 e0 10 48 85 
> c0 0f b6 87 90 00 00 00 75 28 83 e0 f8 83 c8 01 88 87 90 00 00 00 eb 82 <0f> 
> 0b 0f b6 87 90 00 00 00 83 e0 f8 83 c8 03 88 87 90 00 00 00 
> [   20.738722] RIP  [] eth_type_trans+0xe8/0x140
> [   20.744662]  RSP 
> [   20.748160] ---[ end trace 153440bf1ca2e6fc ]---
> [   20.748165] [ cut here ]
> 
> 
> To reproduce:
> 
> git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
> 
> 
> 
> Thanks,
> Kernel Test Robot


So I am trying to reproduce this but need some additional data as just
copying the config apparently is

RE: [mm PATCH v2 01/26] swiotlb: Drop unused functions swiotlb_map_sg and swiotlb_unmap_sg

2016-11-03 Thread Duyck, Alexander H
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Thursday, November 3, 2016 7:46 AM
> To: Konrad Rzeszutek Wilk 
> Cc: Christoph Hellwig ; Duyck, Alexander H
> ; linux...@kvack.org; akpm@linux-
> foundation.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [mm PATCH v2 01/26] swiotlb: Drop unused functions
> swiotlb_map_sg and swiotlb_unmap_sg
> 
> On Thu, Nov 03, 2016 at 10:29:52AM -0400, Konrad Rzeszutek Wilk wrote:
> > Somehow I thought you wanted to put them through your tree (which is
> > why I acked them).
> >
> > I can take them and also the first couple of Alexander through my
> > tree. Or if it makes it simpler - they can go through the -mm tree?
> 
> I don't have a tree for it, so I kinda expected you to pick it up.
> But I'm also fine with you just Acking the version from Alex and having him
> funnel it through whatever tree he wants to get his patches in through.

For the first 3 patches in my series I am fine with them being pulled into the 
swiotlb tree.  So if you want to pull Christoph's two patches, and then drop my 
duplicate patch and instead pull the next 2 I could submit a v3 of my series 
without the swiotlb patches in it.

At this point I have redone my series so that I technically don't have anything 
with a hard dependency on the DMA_ATTR_SKIP_CPU_SYNC actually doing anything 
yet.  My plan is to get this all into Linus's tree first via whatever tree I 
can get these patches pulled into and once I have all that I will start 
updating drivers in net-next.

Thanks.

- Alex


Re: [PATCH RFC] ixgbe: ixgbe_atr() must check if network header is available in headlen

2016-10-17 Thread Duyck, Alexander H
On Sat, 2016-10-15 at 17:31 -0400, Sowmini Varadhan wrote:
> For some Tx paths (e.g., tpacket_snd()), ixgbe_atr may be
> passed down an sk_buff that has the network and transport
> header in the paged data, so it needs to make sure these
> headers are available in the headlen bytes to calculate the
> l4_proto.
> 
> This patch bails out if the headlen is "too short", and does
> not attempt to call skb_header_pointer() to get the needed
> bytes: the assumption is that the caller should set things
> up properly if the l4_proto based tx steering is desired.
> 
> > Signed-off-by: Sowmini Varadhan 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |9 +
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index a244d9a..0868de1 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7632,6 +7632,7 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
> >     struct sk_buff *skb;
> >     __be16 vlan_id;
> >     int l4_proto;
> > +   int min_hdr_size = 0;
>  
> >     /* if ring doesn't have a interrupt vector, cannot perform ATR */
> >     if (!q_vector)
> @@ -7650,6 +7651,14 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
>  
> >     /* snag network header to get L4 type and address */
> >     skb = first->skb;
> > +   if (first->protocol == htons(ETH_P_IP))
> > +   min_hdr_size = sizeof(struct iphdr) +
> > +      sizeof(struct tcphdr);
> > +   else if (first->protocol == htons(ETH_P_IPV6))
> > +   min_hdr_size = sizeof(struct ipv6hdr) +
> > +      sizeof(struct tcphdr);
> > +   if (min_hdr_size && skb_headlen(skb) < ETH_HLEN + min_hdr_size)
> > +   return;
> >     hdr.network = skb_network_header(skb);
> >     if (skb->encapsulation &&
> >     first->protocol == htons(ETH_P_IP) &&

So this doesn't really cover all the cases necessary.  There end up
being essentially 3 spots where we need to perform checks to verify the
header size.

The first one is inside the checks for skb->encapsulation, ETH_P_IP,
and IPPROTO_UDP.  We could probably just verify that skb_tail_pointer
is greater than skb_transport_header + (8 + 8 + 14), the minimum size
needed to support VxLAN.

The second block where we need to perform this check would be after
this check once the network header has been updated.  There we need to
verify that hdr.network + 40 is less than or equal to skb_tail_pointer.
 That covers both IPv4 w/ TCP or IPv6.

The third check that needs to be performed is to verify that
hdr.network + hlen is greater than or equal to skb_tail_pointer - 20.
That is needed to verify we have enough room for the tcp header data to
be pulled.

- Alex


RE: bug in ixgbe_atr

2016-10-14 Thread Duyck, Alexander H


> -Original Message-
> From: Sowmini Varadhan [mailto:sowmini.varad...@oracle.com]
> Sent: Thursday, October 13, 2016 8:49 PM
> To: Duyck, Alexander H 
> Cc: netdev@vger.kernel.org
> Subject: Re: bug in ixgbe_atr
> 
> On (10/14/16 02:06), Duyck, Alexander H wrote:
> > > + case ETH_P_IP:
> > > + skb_header_pointer(skb, ETH_HLEN, sizeof (struct iphdr),
> > > +&ip_hdr);
> > >   /* access ihl as u8 to avoid unaligned access on ia64 */
> > > - hlen = (hdr.network[0] & 0x0F) << 2;
> > > - l4_proto = hdr.ipv4->protocol;
> > > + hlen = ip_hdr.ipv4.ihl << 2;
> > > + l4_proto = ip_hdr.ipv4.protocol;
> > >   break;
>   :
> > The problem is this will break other stuff, for example I have seen
> > the ihl access actually cause problems with unaligned accesses as some
> > architectures decide to pull it as a u32 and then mask it.
> 
> Yes, I noticed that u8 comment for ia64.. if that's the only issue here, we 
> could
> just reset hdr.network to &ip_hdr..
> 
> However, I suspect the above patch is probably not going to work for the vlan
> case (it was just a first-pass hack)

I kind of figured that.  Ideally we only wat to pick out the pieces we need.  I 
would prefer to avoid skb_header_pointer if possible since we only need a few 
parts of the header and don't really need to copy the whole thing.

> > My advice would be to keep this simple.  Add a check to make sure we
> > have room for at least skb_headlen(skb) - 40  >= hrd.raw - skb->data.
> 
> I don't parse that- the hdr union in ixgbe_atr doesnt have a ->raw field. Can 
> you
> explain?

Sorry I was thinking of a different piece of code.  In the case of the atr code 
it would be hdr.network, not hdr.raw.  Basically the thought was to validate 
that there is enough data in skb_headlen() that we can verify that from where 
the network header should be we have at least 40 bytes of data as that would be 
the minimum needed for a TCP header and an IPv4 header, or just an IPv6 header. 
 We would probably need a separate follow-up for the TCP header after we 
validate network header.

> > Messing with the protocol bits will break stuff since there is support
> > for tunneling also floating around in here now.
> >
> > I believe we are planning on dropping this code in favor of
> > ndo_rx_flow_steer in the future.  If we do that then the whole problem
> > becomes moot.
> 
> Dropping it is fine with me I guess - maybe just return, if the
> skb_headlen() doesnt have enough bytes for a network header, i.e., skb_headlen
> is at least ETH_HLEN + sizeof (struct iphdr) for ETH_P_IP, or  ETH_HLEN + 
> sizeof
> (struct ipv6hdr) for ETH_P_IPV6?
> 
> --Sowmini

Right that is kind of what I was thinking.  If we validate that we have at 
least 40 before inspecting the network header, and at least 20 before we 
validate the TCP header that would work for me.

- Alex


RE: bug in ixgbe_atr

2016-10-13 Thread Duyck, Alexander H
> -Original Message-
> From: Sowmini Varadhan [mailto:sowmini.varad...@oracle.com]
> Sent: Thursday, October 13, 2016 6:44 PM
> To: Duyck, Alexander H ;
> netdev@vger.kernel.org
> Subject: bug in ixgbe_atr
> 
> When I was playing around with TPACKET_V2, I think I ran into a bug in
> ixgbe_atr(): if I get here with an sk_buff that has, e.g., just 14 bytes in 
> the
> header, then the skb_network_header is invalid. I found my kernel sometimes
> wandering off into  ipv6_find_hdr() in an attempt to get the l4_proto, and 
> then
> complaining that "IPv6 header not found\n"
> (this was an ipv4 packet).
> 
> I think we want to use skb_header_pointer in ixgbe_atr to get the network
> header itself.. I tried the patch below, and it works for simple (non-vlan, 
> basic)
> ethernet header, but probably needs more refinement to work for more
> complex encapsulations?
> 
> And other drivers may need a similar fix too, I've not checked yet.
> 
> --Sowmini
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index a244d9a..be453c6 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7627,6 +7627,10 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
>   struct iphdr *ipv4;
>   struct ipv6hdr *ipv6;
>   } hdr;
> + union {
> + struct iphdr ipv4;
> + struct ipv6hdr ipv6;
> + } ip_hdr;
>   struct tcphdr *th;
>   unsigned int hlen;
>   struct sk_buff *skb;
> @@ -7667,13 +7671,15 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
>   }
> 
>   /* Currently only IPv4/IPv6 with TCP is supported */
> - switch (hdr.ipv4->version) {
> - case IPVERSION:
> + switch (ntohs(first->protocol)) {
> + case ETH_P_IP:
> + skb_header_pointer(skb, ETH_HLEN, sizeof (struct iphdr),
> +&ip_hdr);
>   /* access ihl as u8 to avoid unaligned access on ia64 */
> - hlen = (hdr.network[0] & 0x0F) << 2;
> - l4_proto = hdr.ipv4->protocol;
> + hlen = ip_hdr.ipv4.ihl << 2;
> + l4_proto = ip_hdr.ipv4.protocol;
>   break;
> - case 6:
> + case ETH_P_IPV6:
>   hlen = hdr.network - skb->data;
>   l4_proto = ipv6_find_hdr(skb, &hlen, IPPROTO_TCP, NULL,
> NULL);
>   hlen -= hdr.network - skb->data;

The problem is this will break other stuff, for example I have seen the ihl 
access actually cause problems with unaligned accesses as some architectures 
decide to pull it as a u32 and then mask it.

My advice would be to keep this simple.  Add a check to make sure we have room 
for at least skb_headlen(skb) - 40  >= hrd.raw - skb->data.  Messing with the 
protocol bits will break stuff since there is support for tunneling also 
floating around in here now.

I believe we are planning on dropping this code in favor of ndo_rx_flow_steer 
in the future.  If we do that then the whole problem becomes moot.

- Alex


Re: [Intel-wired-lan] [PATCH 249/249] net:ethernet:intel:igb_main.c: Add Throttling disable option in order to decrease latency usually required by RT applications.

2016-09-07 Thread Duyck, Alexander H
On Tue, 2016-09-06 at 14:46 +, Amir Yihie wrote:
> From e433f5bedd49a65b7337d025f3f204f0f21ad5e5 Mon Sep 17 00:00:00
> 2001
> From: amiry 
> Date: Tue, 6 Sep 2016 14:58:38 +0300
> Subject: [PATCH 249/249] net:ethernet:intel:igb_main.c: Add
> Throttling disable
>  option in order to decrease latency usually required by RT
> applications.
> 
> Add an option to Disable Throttling algorithm , which will result
> with minimum latency on received frames .
> mainly targets for RT application that look for low larceny and less
> cares about throughput .
> How to use :
> On insmod , add parameter : InterruptThrottleRate=0
> 
> This patch is currently against a linux 4.1.27-rt31 kernel.
> 
> Signed-off-by: Amir Yihie 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index 7dfbcde..9480850 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -251,6 +251,14 @@ static int debug = -1;
>  module_param(debug, int, 0);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +/* Interrupt Throttle Rate (interrupts/sec)
> + *
> + * Valid Range: 0=off, 3=dynamic conservative
> + */
> +static int InterruptThrottleRate = 3;
> +module_param(InterruptThrottleRate, int, 0);
> +MODULE_PARM_DESC(InterruptThrottleRate, "InterruptThrottleRate
> (0=off, 3=dynamic conservative)");
> +

This is creating redundant functionality.  The same can be done via the
ethtool command "ethtool -C  rx-usecs 0".  Is there some reason
why you couldn't use ethtool?

>  struct igb_reg_info {
> u32 ofs;
> char *name;
> @@ -2966,6 +2974,9 @@ static int igb_sw_init(struct igb_adapter
> *adapter)
> adapter->rx_itr_setting = IGB_DEFAULT_ITR;
> adapter->tx_itr_setting = IGB_DEFAULT_ITR;
>  
> +   if (InterruptThrottleRate == 0)
> +   adapter->rx_itr_setting = InterruptThrottleRate;
> +
> /* set default work limits */
> adapter->tx_work_limit = IGB_DEFAULT_TX_WORK;

If you are going to use InterruptThrottleRate you might as well use it
fully instead of making it a boolean value with 0 having one meaning
and the default of 3 being another.

Thanks.

- Alex