Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-22 Thread Tom Herbert
On Mon, May 22, 2017 at 1:42 PM, Jesper Dangaard Brouer
 wrote:
> On Mon, 22 May 2017 08:39:35 +0200
> Jesper Dangaard Brouer  wrote:
>
>> On Sun, 21 May 2017 15:10:29 -0700
>> Tom Herbert  wrote:
>>
>> > On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
>> >  wrote:
>> > > On Sat, 20 May 2017 09:16:09 -0700
>> > > Tom Herbert  wrote:
>> > >
>> > >> > +/* XDP rxhash have an associated type, which is related to the RSS
>> > >> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
>> > >> > + * and support. Thus, create mapping that is interesting for XDP.  
>> > >> > XDP
>> > >> > + * would primarly want insight into L3 and L4 protocol info.
>> > >> > + *
>> > >> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS 
>> > >> > standard
>> > >> > + *
>> > >> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit 
>> > >> > of
>> > >> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
>> > >> > + */
>> > >> > +#define XDP_HASH(x)((x) & ((1ULL << 32)-1))
>> > >> > +#define XDP_HASH_TYPE(x)   ((x) >> 32)
>> > >> > +
>> > >> > +#define XDP_HASH_TYPE_L3_SHIFT 0
>> > >> > +#define XDP_HASH_TYPE_L3_BITS  3
>> > >> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
>> > >> > +#define XDP_HASH_TYPE_L3(x)((x) & XDP_HASH_TYPE_L3_MASK)
>> > >> > +enum {
>> > >> > +   XDP_HASH_TYPE_L3_IPV4 = 1,
>> > >> > +   XDP_HASH_TYPE_L3_IPV6,
>> > >> > +};
>> > >> > +
>> > >> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
>> > >> > +#define XDP_HASH_TYPE_L4_BITS  5
>> > >> > +#define XDP_HASH_TYPE_L4_MASK
>> > >> >   \
>> > >> > +   (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << 
>> > >> > XDP_HASH_TYPE_L4_SHIFT)
>> > >> > +#define XDP_HASH_TYPE_L4(x)((x) & XDP_HASH_TYPE_L4_MASK)
>> > >> > +enum {
>> > >> > +   _XDP_HASH_TYPE_L4_TCP = 1,
>> > >> > +   _XDP_HASH_TYPE_L4_UDP,
>> > >> > +};
>> > >> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
>> > >> > XDP_HASH_TYPE_L4_SHIFT)
>> > >> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
>> > >> > XDP_HASH_TYPE_L4_SHIFT)
>> > >> > +
>> > >> Hi Jesper,
>> > >>
>> > >> Why do we need these indicators for protocol specific hash? It seems
>> > >> like L4 and L3 is useful differentiation and protocol agnostic (I'm
>> > >> still holding out hope that SCTP will be deployed some day ;-) )
>> > >
>> > > I'm not sure I understood the question fully, but let me try to answer
>> > > anyway.  To me it seems obvious that you would want to know the
>> > > protocol/L4 type, as this helps avoid hash collisions between UDP and
>> > > TCP flows.  I can easily imagine someone constructing an UDP packet
>> > > that could hash collide with a given TCP flow.
>> > >
>> > > And yes, i40 support matching SCTP, and we will create a
>> > > XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
>> > >
>> > But where would this information be used? We don't save it in skbuff,
>> > don't use it in RPS, RFS. RSS doesn't use it for packet steering so
>> > the hash collision problem already exists at the device level. If
>> > there is a collision problem between two protocols then maybe hash
>> > should be over 5-tuple instead...
>>
>> One use-case (I heard at a customer) was that they had (web-)servers
>> that didn't serve any UDP traffic, thus they simply block/drop all
>> incoming UDP on the service NIC (as an ACL in the switch). (The servers
>> own DNS lookups and NTP goes through the management NIC to internal
>> DNS/NTP servers).
>>
>> Another use-case: Inside an XDP/bpf program is can be used for
>> splitting protocol processing, into different tail calls, before even
>> touching packet-data.  I can imagine the bpf TCP handling code is
>> larger, thus an optimization is to have a separate tail call for the
>> UDP protocol handling.  One could also transfer/queue all TCP traffic
>> to other CPU(s) like RPS, just without touching packet memory.
>>
>>
>> This info is saved in the skb, but due to space constrains, it is
>> reduced to a single bit, namely skb->l4_hash, iif some
>> RSS-proto/XDP_HASH_TYPE_L4_* bit was set.  And the network stack do use
>> and react on this.
>
> I also want to mention another real-customer use-case.  Some
> deployments have a VXLAN based networks, but some NICs cannot
> understand VXLAN do cannot do proper RSS rx-hashing, which resulted in
> bad CPU scaling as all VXLAN packets gets delivered to the same CPU.
>
They need to turn on RSS for UDP I think.

> Thus, I would like to implement recalculation of the RXHASH in XDP, as
> that would save me implementing yet another extension to the flow
> dissector, that the kernel would have to carry forever, while this is
> just a matter of NIC hashing getting improved.
>
> With the extra L3 and L4 info, I'm assuming that 

Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-22 Thread Jesper Dangaard Brouer
On Mon, 22 May 2017 08:39:35 +0200
Jesper Dangaard Brouer  wrote:

> On Sun, 21 May 2017 15:10:29 -0700
> Tom Herbert  wrote:
> 
> > On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
> >  wrote:  
> > > On Sat, 20 May 2017 09:16:09 -0700
> > > Tom Herbert  wrote:
> > >
> > >> > +/* XDP rxhash have an associated type, which is related to the RSS
> > >> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> > >> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> > >> > + * would primarly want insight into L3 and L4 protocol info.
> > >> > + *
> > >> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS 
> > >> > standard
> > >> > + *
> > >> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> > >> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> > >> > + */
> > >> > +#define XDP_HASH(x)((x) & ((1ULL << 32)-1))
> > >> > +#define XDP_HASH_TYPE(x)   ((x) >> 32)
> > >> > +
> > >> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> > >> > +#define XDP_HASH_TYPE_L3_BITS  3
> > >> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> > >> > +#define XDP_HASH_TYPE_L3(x)((x) & XDP_HASH_TYPE_L3_MASK)
> > >> > +enum {
> > >> > +   XDP_HASH_TYPE_L3_IPV4 = 1,
> > >> > +   XDP_HASH_TYPE_L3_IPV6,
> > >> > +};
> > >> > +
> > >> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> > >> > +#define XDP_HASH_TYPE_L4_BITS  5
> > >> > +#define XDP_HASH_TYPE_L4_MASK 
> > >> >  \
> > >> > +   (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> > >> > +#define XDP_HASH_TYPE_L4(x)((x) & XDP_HASH_TYPE_L4_MASK)
> > >> > +enum {
> > >> > +   _XDP_HASH_TYPE_L4_TCP = 1,
> > >> > +   _XDP_HASH_TYPE_L4_UDP,
> > >> > +};
> > >> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
> > >> > XDP_HASH_TYPE_L4_SHIFT)
> > >> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
> > >> > XDP_HASH_TYPE_L4_SHIFT)
> > >> > +
> > >> Hi Jesper,
> > >>
> > >> Why do we need these indicators for protocol specific hash? It seems
> > >> like L4 and L3 is useful differentiation and protocol agnostic (I'm
> > >> still holding out hope that SCTP will be deployed some day ;-) )
> > >
> > > I'm not sure I understood the question fully, but let me try to answer
> > > anyway.  To me it seems obvious that you would want to know the
> > > protocol/L4 type, as this helps avoid hash collisions between UDP and
> > > TCP flows.  I can easily imagine someone constructing an UDP packet
> > > that could hash collide with a given TCP flow.
> > >
> > > And yes, i40 support matching SCTP, and we will create a
> > > XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
> > >
> > But where would this information be used? We don't save it in skbuff,
> > don't use it in RPS, RFS. RSS doesn't use it for packet steering so
> > the hash collision problem already exists at the device level. If
> > there is a collision problem between two protocols then maybe hash
> > should be over 5-tuple instead...  
> 
> One use-case (I heard at a customer) was that they had (web-)servers
> that didn't serve any UDP traffic, thus they simply block/drop all
> incoming UDP on the service NIC (as an ACL in the switch). (The servers
> own DNS lookups and NTP goes through the management NIC to internal
> DNS/NTP servers).
> 
> Another use-case: Inside an XDP/bpf program is can be used for
> splitting protocol processing, into different tail calls, before even
> touching packet-data.  I can imagine the bpf TCP handling code is
> larger, thus an optimization is to have a separate tail call for the
> UDP protocol handling.  One could also transfer/queue all TCP traffic
> to other CPU(s) like RPS, just without touching packet memory.
> 
> 
> This info is saved in the skb, but due to space constrains, it is
> reduced to a single bit, namely skb->l4_hash, iif some
> RSS-proto/XDP_HASH_TYPE_L4_* bit was set.  And the network stack do use
> and react on this.
 
I also want to mention another real-customer use-case.  Some
deployments have a VXLAN based networks, but some NICs cannot
understand VXLAN do cannot do proper RSS rx-hashing, which resulted in
bad CPU scaling as all VXLAN packets gets delivered to the same CPU.

Thus, I would like to implement recalculation of the RXHASH in XDP, as
that would save me implementing yet another extension to the flow
dissector, that the kernel would have to carry forever, while this is
just a matter of NIC hashing getting improved.

With the extra L3 and L4 info, I'm assuming that XDP_HASH_TYPE_L3(x)
and XDP_HASH_TYPE_L4(x) will be zero for VXLAN as the NIC cannot
identify this.  Thus, I can at an early stage know which packets needs
to get a new rxhash.

I've seen a similar problem with Q-in-Q double tagged VLANs, failing
the RSS-hash 

Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-22 Thread Jesper Dangaard Brouer
On Sun, 21 May 2017 15:10:29 -0700
Tom Herbert  wrote:

> On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
>  wrote:
> > On Sat, 20 May 2017 09:16:09 -0700
> > Tom Herbert  wrote:
> >  
> >> > +/* XDP rxhash have an associated type, which is related to the RSS
> >> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> >> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> >> > + * would primarly want insight into L3 and L4 protocol info.
> >> > + *
> >> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> >> > + *
> >> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> >> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> >> > + */
> >> > +#define XDP_HASH(x)((x) & ((1ULL << 32)-1))
> >> > +#define XDP_HASH_TYPE(x)   ((x) >> 32)
> >> > +
> >> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> >> > +#define XDP_HASH_TYPE_L3_BITS  3
> >> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> >> > +#define XDP_HASH_TYPE_L3(x)((x) & XDP_HASH_TYPE_L3_MASK)
> >> > +enum {
> >> > +   XDP_HASH_TYPE_L3_IPV4 = 1,
> >> > +   XDP_HASH_TYPE_L3_IPV6,
> >> > +};
> >> > +
> >> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> >> > +#define XDP_HASH_TYPE_L4_BITS  5
> >> > +#define XDP_HASH_TYPE_L4_MASK  \
> >> > +   (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> >> > +#define XDP_HASH_TYPE_L4(x)((x) & XDP_HASH_TYPE_L4_MASK)
> >> > +enum {
> >> > +   _XDP_HASH_TYPE_L4_TCP = 1,
> >> > +   _XDP_HASH_TYPE_L4_UDP,
> >> > +};
> >> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
> >> > XDP_HASH_TYPE_L4_SHIFT)
> >> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
> >> > XDP_HASH_TYPE_L4_SHIFT)
> >> > +  
> >> Hi Jesper,
> >>
> >> Why do we need these indicators for protocol specific hash? It seems
> >> like L4 and L3 is useful differentiation and protocol agnostic (I'm
> >> still holding out hope that SCTP will be deployed some day ;-) )  
> >
> > I'm not sure I understood the question fully, but let me try to answer
> > anyway.  To me it seems obvious that you would want to know the
> > protocol/L4 type, as this helps avoid hash collisions between UDP and
> > TCP flows.  I can easily imagine someone constructing an UDP packet
> > that could hash collide with a given TCP flow.
> >
> > And yes, i40 support matching SCTP, and we will create a
> > XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
> >  
> But where would this information be used? We don't save it in skbuff,
> don't use it in RPS, RFS. RSS doesn't use it for packet steering so
> the hash collision problem already exists at the device level. If
> there is a collision problem between two protocols then maybe hash
> should be over 5-tuple instead...

One use-case (I heard at a customer) was that they had (web-)servers
that didn't serve any UDP traffic, thus they simply block/drop all
incoming UDP on the service NIC (as an ACL in the switch). (The servers
own DNS lookups and NTP goes through the management NIC to internal
DNS/NTP servers).

Another use-case: Inside an XDP/bpf program is can be used for
splitting protocol processing, into different tail calls, before even
touching packet-data.  I can imagine the bpf TCP handling code is
larger, thus an optimization is to have a separate tail call for the
UDP protocol handling.  One could also transfer/queue all TCP traffic
to other CPU(s) like RPS, just without touching packet memory.


This info is saved in the skb, but due to space constrains, it is
reduced to a single bit, namely skb->l4_hash, iif some
RSS-proto/XDP_HASH_TYPE_L4_* bit was set.  And the network stack do use
and react on this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-21 Thread John Fastabend
On 05/21/2017 08:21 PM, Alexei Starovoitov wrote:
> On Sun, May 21, 2017 at 05:55:50PM +0200, Jesper Dangaard Brouer wrote:
>>> And it looks useful to me, but
>>
>>> 1. i'm worried that we'd be relying on something that mellanox didn't
>>>  implement in their drivers before. Was it tested and guarnteed to
>>>  exist in the future revisions of firmware? Is it cx4 or cx4-lx or cx5
>>>  feature?
>>
>> It is not a hidden mlx5 or specific feature.  Due to the Microsoft RSS
>> standard/requirements[2] most NICs actually implement this.
>>
>> [2] 
>> https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types
> 
> ...
> 
>>> 2. but the main concern that it is mellanox only feature. At least I cannot
>>> see anything like this in broadcom and intel nics
>>
>> All the drivers I looked at have support for an RSS hash type.
>> Including Broadcom[3] and Intel. Just grep after NETIF_F_RXHASH, and
>> follow data-structs.  The Intel i40 NIC have the most elaborate rss type
>> system (it can e.g. tell if this was SCTP).
>>
>> [3] 
>> http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/broadcom/bnx2x/bnx2x_hsi.h#L4198
> 
> yes and bnxt too.
> msft spec requires RSS to be configured in these different ways, but
> it doesn't mean that HW descriptor will have 'is_v4' and 'is_v6' bits set.
> imo this is mlx specific behavior.
> If you want to piggy back on msft spec and make linux rss to be configurable
> the same way, I guess that's fine, but imo it's orthogonal to xdp.
> 
>>> How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
>>> We can make sure that XDP program does read only access into it and
>>> it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
>>> in there, but this will not be uapi and it will be pretty obvious
>>> to program authors that their programs are vendor specific.
>>
>> This sounds EXTREMELY dangerous to me... IHMO this will lead to vendor
>> lock-in.  As BPF program authors will become dependent on vendor
>> specific features, and their program are no longer portable to run on
>> other NICs.
>>
>> How are you going to avoid vendor lock-in with this model?
> 
> It looked to me that that was the intent of your patch set, hence
> counter proposal to make it much simpler.
> I'm not going to use vendor specific features. The proposal
> to expose hw rx descriptor as-is is for people who desperately want
> that info without burdening core xdp with it.
> 
>>> 'not uapi' here means that mellanox is free to change their HW descriptor
>>> and its contents as they wish.
>>
>> Hmmm... IMHO directly exposing the HW descriptor to userspace, will
>> limit vendors ability to change its contents.
> 
> kprobes can already look at hw rx descriptor.
> if somebody really wants to look into it, they have a way to do it already:
> - add kprobe to mlx5e_handle_rx_cqe(), look into cqe, store the outcome on a 
> side
> - use that info in the xdp program
> All I proposed is to make it first class citizen and avoid kprobe.
> 

Another solution is to have hardware prepend meta-data to the front of the
packet and have the XDP program read it out. Of course the hardware and
XDP program need to be in sync at this point, but it works today assuming
a mechanism to program hardware exists.

The nice part of the above is you push all the complexity of feature negotiation
and hardware initialization out of XDP core completely.

This would be my preferred solution, except I'm not sure if some hardware
would have issue with this.

.John



Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-21 Thread Alexei Starovoitov
On Sun, May 21, 2017 at 05:55:50PM +0200, Jesper Dangaard Brouer wrote:
> > And it looks useful to me, but
> 
> > 1. i'm worried that we'd be relying on something that mellanox didn't
> >  implement in their drivers before. Was it tested and guarnteed to
> >  exist in the future revisions of firmware? Is it cx4 or cx4-lx or cx5
> >  feature?
> 
> It is not a hidden mlx5 or specific feature.  Due to the Microsoft RSS
> standard/requirements[2] most NICs actually implement this.
> 
> [2] 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

...

> > 2. but the main concern that it is mellanox only feature. At least I cannot
> > see anything like this in broadcom and intel nics
> 
> All the drivers I looked at have support for an RSS hash type.
> Including Broadcom[3] and Intel. Just grep after NETIF_F_RXHASH, and
> follow data-structs.  The Intel i40 NIC have the most elaborate rss type
> system (it can e.g. tell if this was SCTP).
> 
> [3] 
> http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/broadcom/bnx2x/bnx2x_hsi.h#L4198

yes and bnxt too.
msft spec requires RSS to be configured in these different ways, but
it doesn't mean that HW descriptor will have 'is_v4' and 'is_v6' bits set.
imo this is mlx specific behavior.
If you want to piggy back on msft spec and make linux rss to be configurable
the same way, I guess that's fine, but imo it's orthogonal to xdp.

> > How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> > We can make sure that XDP program does read only access into it and
> > it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> > in there, but this will not be uapi and it will be pretty obvious
> > to program authors that their programs are vendor specific.
> 
> This sounds EXTREMELY dangerous to me... IHMO this will lead to vendor
> lock-in.  As BPF program authors will become dependent on vendor
> specific features, and their program are no longer portable to run on
> other NICs.
> 
> How are you going to avoid vendor lock-in with this model?

It looked to me that that was the intent of your patch set, hence
counter proposal to make it much simpler.
I'm not going to use vendor specific features. The proposal
to expose hw rx descriptor as-is is for people who desperately want
that info without burdening core xdp with it.

> > 'not uapi' here means that mellanox is free to change their HW descriptor
> > and its contents as they wish.
> 
> Hmmm... IMHO directly exposing the HW descriptor to userspace, will
> limit vendors ability to change its contents.

kprobes can already look at hw rx descriptor.
if somebody really wants to look into it, they have a way to do it already:
- add kprobe to mlx5e_handle_rx_cqe(), look into cqe, store the outcome on a 
side
- use that info in the xdp program
All I proposed is to make it first class citizen and avoid kprobe.



Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-21 Thread Tom Herbert
On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
 wrote:
> On Sat, 20 May 2017 09:16:09 -0700
> Tom Herbert  wrote:
>
>> > +/* XDP rxhash have an associated type, which is related to the RSS
>> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
>> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
>> > + * would primarly want insight into L3 and L4 protocol info.
>> > + *
>> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
>> > + *
>> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
>> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
>> > + */
>> > +#define XDP_HASH(x)((x) & ((1ULL << 32)-1))
>> > +#define XDP_HASH_TYPE(x)   ((x) >> 32)
>> > +
>> > +#define XDP_HASH_TYPE_L3_SHIFT 0
>> > +#define XDP_HASH_TYPE_L3_BITS  3
>> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
>> > +#define XDP_HASH_TYPE_L3(x)((x) & XDP_HASH_TYPE_L3_MASK)
>> > +enum {
>> > +   XDP_HASH_TYPE_L3_IPV4 = 1,
>> > +   XDP_HASH_TYPE_L3_IPV6,
>> > +};
>> > +
>> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
>> > +#define XDP_HASH_TYPE_L4_BITS  5
>> > +#define XDP_HASH_TYPE_L4_MASK  \
>> > +   (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
>> > +#define XDP_HASH_TYPE_L4(x)((x) & XDP_HASH_TYPE_L4_MASK)
>> > +enum {
>> > +   _XDP_HASH_TYPE_L4_TCP = 1,
>> > +   _XDP_HASH_TYPE_L4_UDP,
>> > +};
>> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
>> > XDP_HASH_TYPE_L4_SHIFT)
>> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
>> > XDP_HASH_TYPE_L4_SHIFT)
>> > +
>> Hi Jesper,
>>
>> Why do we need these indicators for protocol specific hash? It seems
>> like L4 and L3 is useful differentiation and protocol agnostic (I'm
>> still holding out hope that SCTP will be deployed some day ;-) )
>
> I'm not sure I understood the question fully, but let me try to answer
> anyway.  To me it seems obvious that you would want to know the
> protocol/L4 type, as this helps avoid hash collisions between UDP and
> TCP flows.  I can easily imagine someone constructing an UDP packet
> that could hash collide with a given TCP flow.
>
> And yes, i40 support matching SCTP, and we will create a
> XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
>
But where would this information be used? We don't save it in skbuff,
don't use it in RPS, RFS. RSS doesn't use it for packet steering so
the hash collision problem already exists at the device level. If
there is a collision problem between two protocols then maybe hash
should be over 5-tuple instead...

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-21 Thread Jesper Dangaard Brouer
On Sat, 20 May 2017 09:16:09 -0700
Tom Herbert  wrote:

> > +/* XDP rxhash have an associated type, which is related to the RSS
> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> > + * would primarly want insight into L3 and L4 protocol info.
> > + *
> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> > + *
> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> > + */
> > +#define XDP_HASH(x)((x) & ((1ULL << 32)-1))
> > +#define XDP_HASH_TYPE(x)   ((x) >> 32)
> > +
> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> > +#define XDP_HASH_TYPE_L3_BITS  3
> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> > +#define XDP_HASH_TYPE_L3(x)((x) & XDP_HASH_TYPE_L3_MASK)
> > +enum {
> > +   XDP_HASH_TYPE_L3_IPV4 = 1,
> > +   XDP_HASH_TYPE_L3_IPV6,
> > +};
> > +
> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> > +#define XDP_HASH_TYPE_L4_BITS  5
> > +#define XDP_HASH_TYPE_L4_MASK  \
> > +   (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4(x)((x) & XDP_HASH_TYPE_L4_MASK)
> > +enum {
> > +   _XDP_HASH_TYPE_L4_TCP = 1,
> > +   _XDP_HASH_TYPE_L4_UDP,
> > +};
> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
> > XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
> > XDP_HASH_TYPE_L4_SHIFT)
> > +  
> Hi Jesper,
> 
> Why do we need these indicators for protocol specific hash? It seems
> like L4 and L3 is useful differentiation and protocol agnostic (I'm
> still holding out hope that SCTP will be deployed some day ;-) )

I'm not sure I understood the question fully, but let me try to answer
anyway.  To me it seems obvious that you would want to know the
protocol/L4 type, as this helps avoid hash collisions between UDP and
TCP flows.  I can easily imagine someone constructing an UDP packet
that could hash collide with a given TCP flow.

And yes, i40 support matching SCTP, and we will create a
XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-21 Thread Jesper Dangaard Brouer
On Fri, 19 May 2017 20:07:52 -0700
Alexei Starovoitov  wrote:

> On Thu, May 18, 2017 at 05:41:48PM +0200, Jesper Dangaard Brouer wrote:
> >  
> > +/* XDP rxhash have an associated type, which is related to the RSS
> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> > + * would primarly want insight into L3 and L4 protocol info.
> > + *
> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> > + *
> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> > + */
> > +#define XDP_HASH(x)((x) & ((1ULL << 32)-1))
> > +#define XDP_HASH_TYPE(x)   ((x) >> 32)
> > +
> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> > +#define XDP_HASH_TYPE_L3_BITS  3
> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> > +#define XDP_HASH_TYPE_L3(x)((x) & XDP_HASH_TYPE_L3_MASK)
> > +enum {
> > +   XDP_HASH_TYPE_L3_IPV4 = 1,
> > +   XDP_HASH_TYPE_L3_IPV6,
> > +};
> > +
> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> > +#define XDP_HASH_TYPE_L4_BITS  5
> > +#define XDP_HASH_TYPE_L4_MASK  
> > \
> > +   (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4(x)((x) & XDP_HASH_TYPE_L4_MASK)
> > +enum {
> > +   _XDP_HASH_TYPE_L4_TCP = 1,
> > +   _XDP_HASH_TYPE_L4_UDP,
> > +};
> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
> > XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
> > XDP_HASH_TYPE_L4_SHIFT)  
> 
> imo this is dangerous territory.
> As far as I can see this information doesn't exist in the current drivers at 
> all
> and you're enabling it in the patch 5 via fancy:
> +   u32 ht = (mlx5_htype_l4_to_xdp[((cht & CQE_RSS_HTYPE_L4) >> 6)] | \
> + mlx5_htype_l3_to_xdp[((cht & CQE_RSS_HTYPE_IP) >> 2)]);
> 
> It's pretty cool that you've discovered this hidden mlx5 feature
> Did you find it in some hw spec ?

The Mellanox ConnectX-4/mlx5 spec is actually open, see:
[1] 
http://www.mellanox.com/page/products_dyn?product_family=204=connectx_4_en_card
and follow link to "Programming Manual (PRM)".


> And it looks useful to me, but

> 1. i'm worried that we'd be relying on something that mellanox didn't
>  implement in their drivers before. Was it tested and guarnteed to
>  exist in the future revisions of firmware? Is it cx4 or cx4-lx or cx5
>  feature?

It is not a hidden mlx5 or specific feature.  Due to the Microsoft RSS
standard/requirements[2] most NICs actually implement this.

[2] 
https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types


> 2. but the main concern that it is mellanox only feature. At least I cannot
> see anything like this in broadcom and intel nics

All the drivers I looked at have support for an RSS hash type.
Including Broadcom[3] and Intel. Just grep after NETIF_F_RXHASH, and
follow data-structs.  The Intel i40 NIC have the most elaborate rss type
system (it can e.g. tell if this was SCTP).

[3] 
http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/broadcom/bnx2x/bnx2x_hsi.h#L4198
 
> In the very beginning we discussed that XDP programs should be as
> generic as possible and HW independent while at the same time we want
> to expose HW specific features to XDP programs.
>
> So I'm totally fine to expose this fancy hw hash and ipv4 vs v6 and
> tcp vs udp flags to xdp programs somehow, but I'm completely against
> making it into uapi.
> 
> How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> We can make sure that XDP program does read only access into it and
> it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> in there, but this will not be uapi and it will be pretty obvious
> to program authors that their programs are vendor specific.

This sounds EXTREMELY dangerous to me... IHMO this will lead to vendor
lock-in.  As BPF program authors will become dependent on vendor
specific features, and their program are no longer portable to run on
other NICs.

How are you going to avoid vendor lock-in with this model?


> 'not uapi' here means that mellanox is free to change their HW descriptor
> and its contents as they wish.

Hmmm... IMHO directly exposing the HW descriptor to userspace, will
limit vendors ability to change its contents.


> Also no copies and bit conversions will be necessary, so the cost will
> be zero to programs that don't use it and we wouldn't need to change
> verifier to discover access to this stuff.

I'm not sure this would work out well, as we would need to keep the
CQE descriptor memory around longer.


The longer term plan with having RXHASH for XDP is to allow
implementing RPS (Receive Packet Steering) without touching packet
memory.  Which 

Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-20 Thread Tom Herbert
On Thu, May 18, 2017 at 8:41 AM, Jesper Dangaard Brouer
 wrote:
> Introducing a new XDP feature and associated bpf helper bpf_xdp_rxhash.
>
> The rxhash and type allow filtering on packets without touching
> packet memory.  The performance difference on my system with a
> 100 Gbit/s mlx5 NIC is 12Mpps to 19Mpps.
>
> TODO: desc RXHASH and associated type, and how XDP choose to map
> and export these to bpf_prog's.
>
> TODO: desc how this interacts with XDP driver features system.
> ---
>  include/linux/filter.h  |   31 -
>  include/linux/netdev_features.h |4 ++
>  include/uapi/linux/bpf.h|   56 +-
>  kernel/bpf/verifier.c   |3 ++
>  net/core/dev.c  |   16 -
>  net/core/filter.c   |   73 
> +++
>  samples/bpf/bpf_helpers.h   |2 +
>  tools/include/uapi/linux/bpf.h  |   10 +
>  8 files changed, 190 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 9a7786db14fa..33a254ccd47d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -413,7 +413,8 @@ struct bpf_prog {
> locked:1,   /* Program image locked? */
> gpl_compatible:1, /* Is filter GPL 
> compatible? */
> cb_access:1,/* Is control block accessed? 
> */
> -   dst_needed:1;   /* Do we need dst entry? */
> +   dst_needed:1,   /* Do we need dst entry? */
> +   xdp_rxhash_needed:1;/* Req XDP RXHASH support 
> */
> kmemcheck_bitfield_end(meta);
> enum bpf_prog_type  type;   /* Type of BPF program */
> u32 len;/* Number of filter blocks */
> @@ -444,12 +445,40 @@ struct bpf_skb_data_end {
> void *data_end;
>  };
>
> +/* Kernel internal xdp_buff->flags */
> +#define XDP_CTX_F_RXHASH_TYPE_MASK XDP_HASH_TYPE_MASK
> +#define XDP_CTX_F_RXHASH_TYPE_BITS XDP_HASH_TYPE_BITS
> +#define XDP_CTX_F_RXHASH_SW(1ULL <<  XDP_CTX_F_RXHASH_TYPE_BITS)
> +#define XDP_CTX_F_RXHASH_HW(1ULL << 
> (XDP_CTX_F_RXHASH_TYPE_BITS+1))
> +
>  struct xdp_buff {
> void *data;
> void *data_end;
> void *data_hard_start;
> +   u64 flags;
> +   u32 rxhash;
>  };
>
> +/* helper functions for driver setting rxhash */
> +static inline void
> +xdp_record_hash(struct xdp_buff *xdp, u32 hash, u32 type)
> +{
> +   xdp->flags |= XDP_CTX_F_RXHASH_HW;
> +   xdp->flags |= type & XDP_CTX_F_RXHASH_TYPE_MASK;
> +   xdp->rxhash = hash;
> +}
> +
> +static inline void
> +xdp_set_skb_hash(struct xdp_buff *xdp, struct sk_buff *skb)
> +{
> +   if (likely(xdp->flags & (XDP_CTX_F_RXHASH_HW|XDP_CTX_F_RXHASH_SW))) {
> +   bool is_sw = !!(xdp->flags | XDP_CTX_F_RXHASH_SW);
> +   bool is_l4 = !!(xdp->flags & XDP_HASH_TYPE_L4_MASK);
> +
> +   __skb_set_hash(skb, xdp->rxhash, is_sw, is_l4);
> +   }
> +}
> +
>  /* compute the linear packet data range [data, data_end) which
>   * will be accessed by cls_bpf, act_bpf and lwt programs
>   */
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index ff81ee231410..4b50e8c606c5 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -219,11 +219,13 @@ enum {
>  /* XDP driver flags */
>  enum {
> XDP_DRV_F_ENABLED_BIT,
> +   XDP_DRV_F_RXHASH_BIT,
>  };
>
>  #define __XDP_DRV_F_BIT(bit)   ((netdev_features_t)1 << (bit))
>  #define __XDP_DRV_F(name)  __XDP_DRV_F_BIT(XDP_DRV_F_##name##_BIT)
>  #define XDP_DRV_F_ENABLED  __XDP_DRV_F(ENABLED)
> +#define XDP_DRV_F_RXHASH   __XDP_DRV_F(RXHASH)
>
>  /* XDP driver MUST support these features, else kernel MUST reject
>   * bpf_prog to guarantee safe access to data structures
> @@ -233,7 +235,7 @@ enum {
>  /* Some XDP features are under development. Based on bpf_prog loading
>   * detect if kernel feature can be activated.
>   */
> -#define XDP_DRV_FEATURES_DEVEL 0
> +#define XDP_DRV_FEATURES_DEVEL XDP_DRV_F_RXHASH
>
>  /* Some XDP features are optional, like action return code, as they
>   * are handled safely runtime.
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 945a1f5f63c5..1d9d3a46217d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -482,6 +482,9 @@ union bpf_attr {
>   * Get the owner uid of the socket stored inside sk_buff.
>   * @skb: pointer to skb
>   * Return: uid of the socket owner on success or overflowuid if failed.
> + *
> + * u64 bpf_xdp_rxhash(xdp_md, new_hash, type, flags)
> + * TODO: MISSING DESC
>   */
>  #define __BPF_FUNC_MAPPER(FN)  \
> FN(unspec), \
> @@ -531,7 +534,8 @@ union 

Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-19 Thread Jakub Kicinski
On Fri, 19 May 2017 20:34:00 -0700, Alexei Starovoitov wrote:
> On Fri, May 19, 2017 at 08:21:47PM -0700, Jakub Kicinski wrote:
> > On Fri, 19 May 2017 20:07:52 -0700, Alexei Starovoitov wrote:  
> > > How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> > > We can make sure that XDP program does read only access into it and
> > > it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> > > in there, but this will not be uapi and it will be pretty obvious
> > > to program authors that their programs are vendor specific.
> > > 'not uapi' here means that mellanox is free to change their HW descriptor
> > > and its contents as they wish.  
> > 
> > Hm..  Would that mean we have to teach the verifier about all possible
> > drivers and their metadata structures (i.e. sizes thereof).  And add an
> > UAPI enum of known drivers?  
> 
> why? no uapi other than a pointer to this hw rx descriptor.
> Different sizeof(hw_rx_descriptor) is not a problem.
> We deal with it already in tracing. All tracepoints have different
> sizeof(*ctx), yet the safety is preserved.

Ack, quick read of tracing code reveals this indeed should work.


Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-19 Thread Alexei Starovoitov
On Fri, May 19, 2017 at 08:21:47PM -0700, Jakub Kicinski wrote:
> On Fri, 19 May 2017 20:07:52 -0700, Alexei Starovoitov wrote:
> > How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> > We can make sure that XDP program does read only access into it and
> > it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> > in there, but this will not be uapi and it will be pretty obvious
> > to program authors that their programs are vendor specific.
> > 'not uapi' here means that mellanox is free to change their HW descriptor
> > and its contents as they wish.
> 
> Hm..  Would that mean we have to teach the verifier about all possible
> drivers and their metadata structures (i.e. sizes thereof).  And add an
> UAPI enum of known drivers?

why? no uapi other than a pointer to this hw rx descriptor.
Different sizeof(hw_rx_descriptor) is not a problem.
We deal with it already in tracing. All tracepoints have different
sizeof(*ctx), yet the safety is preserved.

> Other idea I floated in early days was to standardize the fields but
> let the driver "JIT" the accesses to look at the right offset of the
> right structure.  Admittedly that would be a lot more work.

'standardize the fields' sounds nice, but failed here already.
As far as I can see the meaning of packet 'hash' is quite different
across the drivers and 'hash' is just a beginning.
I hope we can standardize on 'csum' field and make it checksum_complete,
but so far out of 10+G nics only mlx5 and nfp do it in hw.
We need it at least for mlx4, but it can only fake it via expensive math.



Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-19 Thread Jakub Kicinski
On Fri, 19 May 2017 20:07:52 -0700, Alexei Starovoitov wrote:
> How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> We can make sure that XDP program does read only access into it and
> it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> in there, but this will not be uapi and it will be pretty obvious
> to program authors that their programs are vendor specific.
> 'not uapi' here means that mellanox is free to change their HW descriptor
> and its contents as they wish.

Hm..  Would that mean we have to teach the verifier about all possible
drivers and their metadata structures (i.e. sizes thereof).  And add an
UAPI enum of known drivers?

Other idea I floated in early days was to standardize the fields but
let the driver "JIT" the accesses to look at the right offset of the
right structure.  Admittedly that would be a lot more work.


Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-19 Thread Alexei Starovoitov
On Thu, May 18, 2017 at 05:41:48PM +0200, Jesper Dangaard Brouer wrote:
>  
> +/* XDP rxhash have an associated type, which is related to the RSS
> + * (Receive Side Scaling) standard, but NIC HW have different mapping
> + * and support. Thus, create mapping that is interesting for XDP.  XDP
> + * would primarly want insight into L3 and L4 protocol info.
> + *
> + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> + *
> + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> + */
> +#define XDP_HASH(x)  ((x) & ((1ULL << 32)-1))
> +#define XDP_HASH_TYPE(x) ((x) >> 32)
> +
> +#define XDP_HASH_TYPE_L3_SHIFT   0
> +#define XDP_HASH_TYPE_L3_BITS3
> +#define XDP_HASH_TYPE_L3_MASK((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> +#define XDP_HASH_TYPE_L3(x)  ((x) & XDP_HASH_TYPE_L3_MASK)
> +enum {
> + XDP_HASH_TYPE_L3_IPV4 = 1,
> + XDP_HASH_TYPE_L3_IPV6,
> +};
> +
> +#define XDP_HASH_TYPE_L4_SHIFT   XDP_HASH_TYPE_L3_BITS
> +#define XDP_HASH_TYPE_L4_BITS5
> +#define XDP_HASH_TYPE_L4_MASK
> \
> + (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> +#define XDP_HASH_TYPE_L4(x)  ((x) & XDP_HASH_TYPE_L4_MASK)
> +enum {
> + _XDP_HASH_TYPE_L4_TCP = 1,
> + _XDP_HASH_TYPE_L4_UDP,
> +};
> +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << 
> XDP_HASH_TYPE_L4_SHIFT)
> +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << 
> XDP_HASH_TYPE_L4_SHIFT)

imo this is dangerous territory.
As far as I can see this information doesn't exist in the current drivers at all
and you're enabling it in the patch 5 via fancy:
+   u32 ht = (mlx5_htype_l4_to_xdp[((cht & CQE_RSS_HTYPE_L4) >> 6)] | \
+ mlx5_htype_l3_to_xdp[((cht & CQE_RSS_HTYPE_IP) >> 2)]);

It's pretty cool that you've discovered this hidden mlx5 feature
Did you find it in some hw spec ?
And it looks useful to me, but
1. i'm worried that we'd be relying on something that mellanox didn't
 implement in their drivers before. Was it tested and guarnteed to exist
 in the future revisions of firmware? Is it cx4 or cx4-lx or cx5 feature?
2. but the main concern that it is mellanox only feature. At least I cannot
see anything like this in broadcom and intel nics

In the very beginning we discussed that XDP programs should be as generic as
possible and HW independent while at the same time we want to expose HW
specific features to XDP programs.
So I'm totally fine to expose this fancy hw hash and ipv4 vs v6 and tcp vs udp
flags to xdp programs somehow, but I'm completely against making it into uapi.

How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
We can make sure that XDP program does read only access into it and
it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
in there, but this will not be uapi and it will be pretty obvious
to program authors that their programs are vendor specific.
'not uapi' here means that mellanox is free to change their HW descriptor
and its contents as they wish.
Also no copies and bit conversions will be necessary, so the cost will
be zero to programs that don't use it and we wouldn't need to change
verifier to discover access to this stuff.

Note that bpf programs already have access to all in-kernel data structures
on the tracing side, so this is nothing new and tracing program authors
got used to structures changing from kernel to kernel.
XDP program authors can do the same for vendor specific bits while we
keep core XDP uapi generic across all nics.



Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-19 Thread Jesper Dangaard Brouer
On Thu, 18 May 2017 17:41:48 +0200
Jesper Dangaard Brouer  wrote:

> Introducing a new XDP feature and associated bpf helper bpf_xdp_rxhash.
> 
> The rxhash and type allow filtering on packets without touching
> packet memory.  The performance difference on my system with a
> 100 Gbit/s mlx5 NIC is 12Mpps to 19Mpps.

The XDP/bpf program I use (called xdp_rxhash) for testing this feature
is available via my github repo here:

 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_rxhash_kern.c
 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_rxhash_user.c

The cmdline output looks like:

$ sudo ./xdp_rxhash --dev mlx5p2 --sec 2 --notouch

xdp-action ppspps-human-readable mem  
XDP_ABORTED19694682   19,694,682 2.000205  no_touch
XDP_DROP   0  0  2.000205  no_touch
XDP_PASS   10 10 2.000205  no_touch
XDP_TX 0  0  2.000205  no_touch
rx_total   19694701   19,694,701 2.000205  no_touch

hash_type:L3   ppspps-human-readable sample-period
Unknown0  0  2.000205
IPv4   19694725   19,694,725 2.000205
IPv6   0  0  2.000205

hash_type:L4   ppspps-human-readable sample-period
Unknown10 10 2.000205
TCP0  0  2.000205
UDP19694697   19,694,697 2.000205

^CInterrupted: Removing XDP program on ifindex:5 device:mlx5p2


The 10 pps XDP_PASS is a ping command I rand at the same time. Notice
how these ping-ICMP packets were categorized as L4=Unknown and L3=IPv4.
The L4 categorization is usually UDP or TCP, but looking at driver-code
it seems some HW support detecting other L4 types, like ICMP, SCTP, etc.


$ sudo taskset -c 4 ping -i 0.1 -c 1 198.18.100.1 -c 100 -q
PING 198.18.100.1 (198.18.100.1) 56(84) bytes of data.

--- 198.18.100.1 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 10294ms
rtt min/avg/max/mdev = 0.010/0.071/0.113/0.043 ms


p.s. xdp-newbies can via this commit find the links back to the netdev
kernel RFC patches/mails:
 https://github.com/netoptimizer/prototype-kernel/commit/9647e1b563970

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers

2017-05-18 Thread Jesper Dangaard Brouer
Introducing a new XDP feature and associated bpf helper bpf_xdp_rxhash.

The rxhash and type allow filtering on packets without touching
packet memory.  The performance difference on my system with a
100 Gbit/s mlx5 NIC is 12Mpps to 19Mpps.

TODO: desc RXHASH and associated type, and how XDP choose to map
and export these to bpf_prog's.

TODO: desc how this interacts with XDP driver features system.
---
 include/linux/filter.h  |   31 -
 include/linux/netdev_features.h |4 ++
 include/uapi/linux/bpf.h|   56 +-
 kernel/bpf/verifier.c   |3 ++
 net/core/dev.c  |   16 -
 net/core/filter.c   |   73 +++
 samples/bpf/bpf_helpers.h   |2 +
 tools/include/uapi/linux/bpf.h  |   10 +
 8 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9a7786db14fa..33a254ccd47d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -413,7 +413,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   xdp_rxhash_needed:1;/* Req XDP RXHASH support */
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
@@ -444,12 +445,40 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
+/* Kernel internal xdp_buff->flags */
+#define XDP_CTX_F_RXHASH_TYPE_MASK XDP_HASH_TYPE_MASK
+#define XDP_CTX_F_RXHASH_TYPE_BITS XDP_HASH_TYPE_BITS
+#define XDP_CTX_F_RXHASH_SW(1ULL <<  XDP_CTX_F_RXHASH_TYPE_BITS)
+#define XDP_CTX_F_RXHASH_HW(1ULL << (XDP_CTX_F_RXHASH_TYPE_BITS+1))
+
 struct xdp_buff {
void *data;
void *data_end;
void *data_hard_start;
+   u64 flags;
+   u32 rxhash;
 };
 
+/* helper functions for driver setting rxhash */
+static inline void
+xdp_record_hash(struct xdp_buff *xdp, u32 hash, u32 type)
+{
+   xdp->flags |= XDP_CTX_F_RXHASH_HW;
+   xdp->flags |= type & XDP_CTX_F_RXHASH_TYPE_MASK;
+   xdp->rxhash = hash;
+}
+
+static inline void
+xdp_set_skb_hash(struct xdp_buff *xdp, struct sk_buff *skb)
+{
+   if (likely(xdp->flags & (XDP_CTX_F_RXHASH_HW|XDP_CTX_F_RXHASH_SW))) {
+   bool is_sw = !!(xdp->flags | XDP_CTX_F_RXHASH_SW);
+   bool is_l4 = !!(xdp->flags & XDP_HASH_TYPE_L4_MASK);
+
+   __skb_set_hash(skb, xdp->rxhash, is_sw, is_l4);
+   }
+}
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf, act_bpf and lwt programs
  */
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index ff81ee231410..4b50e8c606c5 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -219,11 +219,13 @@ enum {
 /* XDP driver flags */
 enum {
XDP_DRV_F_ENABLED_BIT,
+   XDP_DRV_F_RXHASH_BIT,
 };
 
 #define __XDP_DRV_F_BIT(bit)   ((netdev_features_t)1 << (bit))
 #define __XDP_DRV_F(name)  __XDP_DRV_F_BIT(XDP_DRV_F_##name##_BIT)
 #define XDP_DRV_F_ENABLED  __XDP_DRV_F(ENABLED)
+#define XDP_DRV_F_RXHASH   __XDP_DRV_F(RXHASH)
 
 /* XDP driver MUST support these features, else kernel MUST reject
  * bpf_prog to guarantee safe access to data structures
@@ -233,7 +235,7 @@ enum {
 /* Some XDP features are under development. Based on bpf_prog loading
  * detect if kernel feature can be activated.
  */
-#define XDP_DRV_FEATURES_DEVEL 0
+#define XDP_DRV_FEATURES_DEVEL XDP_DRV_F_RXHASH
 
 /* Some XDP features are optional, like action return code, as they
  * are handled safely runtime.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 945a1f5f63c5..1d9d3a46217d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -482,6 +482,9 @@ union bpf_attr {
  * Get the owner uid of the socket stored inside sk_buff.
  * @skb: pointer to skb
  * Return: uid of the socket owner on success or overflowuid if failed.
+ *
+ * u64 bpf_xdp_rxhash(xdp_md, new_hash, type, flags)
+ * TODO: MISSING DESC
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -531,7 +534,8 @@ union bpf_attr {
FN(xdp_adjust_head),\
FN(probe_read_str), \
FN(get_socket_cookie),  \
-   FN(get_socket_uid),
+   FN(get_socket_uid), \
+   FN(xdp_rxhash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function