RE: XDP seeking input from NIC hardware vendors

2016-07-07 Thread Fastabend, John R
Hi Jesper,

I have done some previous work on proprietary systems where we used hardware to 
do the classification/parsing then passed a cookie to the software which used 
the cookie to lookup a program to run on the packet. When your programs are 
structured as a bunch of parsing followed by some actions this can provide real 
performance benefits. Also a lot of existing hardware supports this today 
assuming you use headers the hardware "knows" about. It's a natural model for 
hardware that uses a parser followed by tcam/cam/sram/etc lookup tables.

If the goal is to just separate XDP traffic from non-XDP traffic you could 
accomplish this with a combination of SR-IOV/macvlan to separate the device 
queues into multiple netdevs and then run XDP on just one of the netdevs. Then 
use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the 
netdev. This is how we support multiple networking stacks on one device by the 
way it is called the bifurcated driver. Its not too far of a stretch to think 
we could offload some simple XDP programs to program the splitting of traffic 
instead of cls_u32/flower/flow_director and then you would have a stack of XDP 
programs. One running in hardware and a set running on the queues in software.

The other interesting thing would be to do more than just packet steering but 
actually run a more complete XDP program. Netronome supports this right. The 
question I have though is this a stacked of XDP programs one or more designated 
for hardware and some running in software perhaps with some annotation in the 
program so the hardware JIT knows where to place programs or do we expect the 
JIT itself to try and decide what is best to offload. I think the easiest to 
start with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the front or 
end of a packet so you could push metadata calculated by the program here in a 
generic way without having to extend XDP defined metadata structures. Another 
option is to DMA the metadata to a specified address. With this metadata the 
consumer/producer XDP programs have to agree on the format but no one else.

FWIW I was hoping to get some data to show performance overhead vs how deep we 
parse into the packets. I just wont have time to get to it for awhile but that 
could tell us how much perf gain the hardware could provide.

Thanks,
John

-Original Message-
From: Jesper Dangaard Brouer [mailto:bro...@redhat.com] 
Sent: Thursday, July 7, 2016 3:43 AM
To: iovisor-...@lists.iovisor.org
Cc: bro...@redhat.com; Brenden Blanco ; Alexei 
Starovoitov ; Rana Shahout ; 
Ari Saha ; Tariq Toukan ; Or Gerlitz 
; netdev@vger.kernel.org; Simon Horman 
; Simon Horman ; Jakub Kicinski 
; Edward Cree ; Fastabend, 
John R 
Subject: XDP seeking input from NIC hardware vendors


Would it make sense from a hardware point of view, to split the XDP eBPF 
program into two stages.

 Stage-1: Filter (restricted eBPF / no-helper calls)
 Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the likely more 
advanced stage-2 on the kernel side.  Do HW vendors see a benefit of this 
approach?


The generic problem I'm trying to solve is parsing. E.g. that the first step in 
every XDP program will be to parse the packet-data, in-order to determine if 
this is a packet the XDP program should process.

Actions from stage-1 "filter" program:
 - DROP (like XDP_DROP, early drop)
 - PASS (like XDP_PASS, normal netstack)
 - MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that makes 
sense for the stage-2 program. E.g. proto id and/or data offset.

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.



Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload

2016-02-09 Thread Fastabend, John R
[...]

>>
>> If you leave ht and order off the tc cli I believe 'tc' just
>> picks some semi-arbitrary ones for you. I've been in the habit
>> of always specifying them even for software filters.
>>
> 
> The default table id is essentially 0x800. Default bucket is 0.
> "order" essentially is the filter id. And given you can link tables
> (Nice work John!); essentially the ht:bucket:nodeid is an "address" to
> a specific filter on a specific table and when makes sense a specific
> hash bucket. Some other way to look at it is as a way to construct
> a mapping to a TCAM key.
> What John is doing is essentially taking the nodeid and trying to use
> it as a priority. In otherwise the abstraction is reduced to a linked
> list in which the ordering is how the list is traversed.
> It may work in this case, but i am for being able to explicitly specify
> priorities.

Sorry bombing you with emails Jamal. Another thing to note is ixgbe
doesn't support hash tables explicitly but our other devices do. So
when a hash node is created we can map that onto a hardware block
and actually do the hash.

> 
> cheers,
> jamal
> 


Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload

2016-02-09 Thread Fastabend, John R
[...]

>> Ah I should have annotated this in the commit msg. I turn the feature
>> off by default to enable it the user needs to run
>>
>>  # ethtool -K ethx hw-tc-offload on
>>
>> this is just a habit of mine to leave new features off by default for
>> a bit until I work out some of the kinks. For example I found a case
>> today where if you build loops into your u32 graph the hardware tables
>> can get out of sync with the software tables. This is sort of extreme
>> corner case not sure if anyone would really use u32 but it is valid
>> and the hardware should abort correctly.
> Yeh - that is nice :) But I was just pointing out on a small typo which I
> think you have.
> The new case will never happen. You compare: (features & NETIF_F_NTUPLE) == 
> NETIF_F_HW_TC
> Also the comment before the switch should be modified.

Aha nice catch my scripts were enabling both ntuple and hw-tc-offload
for testing compatibility issues. I wonder if there is a bug somewhere
else though that checks that code most likely because it was definately
getting offloaded.

Good catch again thanks.

> 
>>
>> Thanks,
>> John
>>


Re: [net-next PATCH 0/7] tc offload for cls_u32 on ixgbe

2016-02-09 Thread Fastabend, John R
On 2/4/2016 5:12 AM, Jamal Hadi Salim wrote:
> 
> On 16-02-03 01:48 PM, Fastabend, John R wrote:
> 
> BTW: For the record John, I empathize with you that we need to
> move. Please have patience - we are close; lets just get this resolved
> in Seville. I like your patches a lot and would love to just have
> your patches pushed in, but the challenges with community is being able
> to reach some middle ground. We are not as bad as some of the standards
> organizations. I am sure we'll get this resolved by end of next week
> if not, I am %100 in agreement some form of your patches (And Amir's
> need to go in and then we can refactor as needed)

Agreed although I'm a bit worried we are starting to talk about a
single hardware IR. This discussion has always failed in my experience.

> 
>>> 1) "priorities" for filters and some form of "index" for actions is
>>> is needed. I think index (which tends to be a 32 bit value is what
>>> Amir's patches refered to as "cookie" - or at least some hardware
>>> can be used to query the action with). Priorities maybe implicit in
>>> the order in which they are added. And th idea of appending vs
>>> exclusivity vs replace (which  netlink already supports)
>>> is important to worry about (TCAMS tend to assume an append mode
>>> for example).
>>
>> The code denotes add/del/replace already. I'm not sure why a TCAM
>> would assume an append mode but OK maybe that is some API you have
>> the APIs I use don't have these semantics.
>>
> 
> Basically most hardware (or i should say driver implementations of
> mostly TCAMS) allow you to add exactly the same filter as many times
> as you want. They dont really look at what you want to filter on
> and then scream "conflict". IOW, you (user) are responsible for
> conflict resolution at the filter level. The driver sees this blob
> and requests for some index/key from the hardware then just adds it.
> You can then use this key/index to delete/replace etc.
> This is what i meant by "append" mode.
> However if a classifier implementation cares about filter ambiguity
> resolution, then priorities are used. We need to worry about the
> bigger picture.
> 

Sure in other classifiers its used but its not needed in the set I
planned to added it later.

> 
>> For this series using cls_u32 the handle gives you everything you need
>> to put entries in the right table and row. Namely the ht # and order #
>> from 'tc'.
> 
> True - but with a caveat. There are only 2^12 max tables you can
> have for example and up to 2^12 filters per bucket etc.
> 

This is a software limitation as well right? If it hasn't showed up
as a limitation on the software side why would it be an issue here?
Do you have more than 2^12 tables on your devices? If so I guess we
can tack on another 32bits somewhere.

>> Take a look at u32_change and u32_classify its the handle
>> that places the filter into the list and the handle that is matched in
>> classify. We should place the filters in the hardware in the same order
>> that is used by u32_change.
>>
> 
> I can see some parallels, but:
> The nodeid in itself is insufficent for two reasons:
> You cant have more than 2^12 filters per bucket;
> and the nodeid then takes two meanings: a) it is an id
> b) it specifies the order in which things are looked up.
> 
> I think you need to take the u32 address and map it to something in your
> hardware. But at the same time it is important to have the abstraction
> closely emulate your hardware.
> 

IMO the hardware/interface must preserve the same ordering of
filters/hash_Tables/etc. How it does that mapping should be
a driver concern and it can always abort if it fails.

>> Also ran a few tests and can't see how priority works in u32 maybe you
>> can shed some light but as best I can tell it doesn't have any effect
>> on rule execution.
>>
> 
> True.
> u32 doesnt care because it will give you a nodeid if you dont specify
> one. i.e conflict resolution is mapped to you not specifying exactly
> the same ht:bkt:nodeid more than once. And if you will let the
> kernel do it for you (as i am assumming you are saying your hardware
> will) then no need.

Yep. Faithfully offloading u32 here not changing anything except
I do have to abort on some cases with the simpler devices. fm10k for
example can model hash nodes with divisors > 1.

> 
>>>
>>> 2) I like the u32 approach where it makes sense; but sometimes it
>>> doesnt make sense from a usability pov. I work with some ASICs
>>> that have 10 tuples that are  fixed. Yes, a user can descri

Re: [net-next PATCH 3/7] net: sched: add cls_u32 offload hooks for netdevs

2016-02-09 Thread Fastabend, John R
On 2/4/2016 5:18 AM, Amir Vadai" wrote:
> On Wed, Feb 03, 2016 at 01:28:37AM -0800, John Fastabend wrote:
>> This patch allows netdev drivers to consume cls_u32 offloads via
>> the ndo_setup_tc ndo op.
>>
>> This works aligns with how network drivers have been doing qdisc
>> offloads for mqprio.
>>
>> Signed-off-by: John Fastabend 
>> ---

[...]

>> +enum {
>> +TC_CLSU32_NEW_KNODE,
> TC_CLSU32_NEW_KNODE is never used

aha yep that snuck in there. In a follow up patch for the fm10k devices
where we can support hash tables (e.g. divisor > 1) I use it. Although
on closer inspection I need to check that the divisor == 1 on ixgbe or
else abort because we can get out of sync if software expects hash
tables here.

Thanks, nice catch.

> 
> [...]
> 


Re: [net-next PATCH 0/7] tc offload for cls_u32 on ixgbe

2016-02-09 Thread Fastabend, John R
On 2/4/2016 3:19 PM, Pablo Neira Ayuso wrote:
> On Thu, Feb 04, 2016 at 10:16:56AM +0100, Jiri Pirko wrote:
>> Wed, Feb 03, 2016 at 10:27:32AM CET, john.fastab...@gmail.com wrote:
>>>
>>> Also by adding get_parse_graph and set_parse_graph attributes as
>>> in my previous flow_api work we can build programmable devices
>>> and programmatically learn when rules can or can not be loaded
>>> into the hardware. Again future work.
>>>
>>> Any comments/feedback appreciated.

Sorry if you get this twice it doesn't look like my original response
made it to netdev and the laptop I replied on charger blew up.

>>
>> I like this being thin and elegant solution. However, ~2 years ago when I
>> pushed openvswitch kernel datapath offload patchset, people were yelling
>> at me that it is not generic enough solution, that tc has to be able
>> to use the api (Jamal :)), nftables as well.
>

The other problem with OVS is if you have the capabilities to do
wildcard lookups (e.g. TCAM/SRAM/etc) then offloading the exact
match table in OVS is really inefficient use of the resource. You
really want to load the megaflow table into hardware. I just don't
think its a good scheme for what you want.

> I would be glad to join this debate during NetDev 1.1 too.
>

great.

> I think we should provide a solution that allows people uses both
> tc and nftables, this would require a bit of generic infrastructure on
> top of it so we don't restrict users to one single solution, in other
> words, we allow the user to select its own poison.
>
>> Now this patch is making offload strictly tc-based and nobody seems to
>> care :) I do. I think that we might try to find some generic middle
>> layer.

If we can build the universal model for 'tc' and 'nftable' we should
unify them higher in the stack? It doesn't make sense to me for the
driver folks to try and create the unified model for two subsystems
if we don't think its worthwhile in software as well.

>
> I agree and I'll be happy to help to push this ahead. Let's try to sit
> and get together to resolve this.

Great.

>
> See you soon.
>



Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload

2016-02-04 Thread Fastabend, John R
On 2/3/2016 11:30 PM, Amir Vadai" wrote:
> On Wed, Feb 03, 2016 at 01:29:59AM -0800, John Fastabend wrote:
>> This adds initial support for offloading the u32 tc classifier. This
>> initial implementation only implements a few base matches and actions
>> to illustrate the use of the infrastructure patches.
>>
>> However it is an interesting subset because it handles the u32 next
>> hdr logic to correctly map tcp packets from ip headers using the ihl
>> and protocol fields. After this is accepted we can extend the match
>> and action fields easily by updating the model header file.
>>
>> Also only the drop action is supported initially.
>>
>> Here is a short test script,
>>
>>  #tc qdisc add dev eth4 ingress
>>  #tc filter add dev eth4 parent : protocol ip \
>>  u32 ht 800: order 1 \
>>  match ip dst 15.0.0.1/32 match ip src 15.0.0.2/32 action drop
>>
>> <-- hardware has dst/src ip match rule installed -->
>>
>>  #tc filter del dev eth4 parent : prio 49152
>>  #tc filter add dev eth4 parent : protocol ip prio 99 \
>>  handle 1: u32 divisor 1
>>  #tc filter add dev eth4 protocol ip parent : prio 99 \
>>  u32 ht 800: order 1 link 1: \
>>  offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff
>>  #tc filter add dev eth4 parent : protocol ip \
>>  u32 ht 1: order 3 match tcp src 23  action drop
>>
>> <-- hardware has tcp src port rule installed -->
>>
>>  #tc qdisc del dev eth4 parent :
>>
>> <-- hardware cleaned up -->
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h |3 
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |6 -
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|  196 
>> ++
>>  3 files changed, 198 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> index 4b9156c..09c2d9b 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> 
> [...]
> 
>> @@ -8277,6 +8465,7 @@ static int ixgbe_set_features(struct net_device 
>> *netdev,
>>   */
>>  switch (features & NETIF_F_NTUPLE) {
>>  case NETIF_F_NTUPLE:
>> +case NETIF_F_HW_TC:
>>  /* turn off ATR, enable perfect filters and reset */
>>  if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
>>  need_reset = true;
> 
> I think you have a bug here. I don't see how the NETIF_F_HW_TC case will
> happen after masking 'features' out.
> 

Ah I should have annotated this in the commit msg. I turn the feature
off by default to enable it the user needs to run

# ethtool -K ethx hw-tc-offload on

this is just a habit of mine to leave new features off by default for
a bit until I work out some of the kinks. For example I found a case
today where if you build loops into your u32 graph the hardware tables
can get out of sync with the software tables. This is sort of extreme
corner case not sure if anyone would really use u32 but it is valid
and the hardware should abort correctly.

Thanks,
John



Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload

2016-02-03 Thread Fastabend, John R
On 2/3/2016 4:46 AM, Jamal Hadi Salim wrote:
[...]


>>>
>>> What are you doing w.r.t priorities? Are the filters processed by the
>>> order of the priorities?
>>>

In the same order as software processes filters. I tried to faithfully
translate the u32 classify and  u32_change loops into hardware. If I
missed some case its a bug and I'll fix it. My reading and experiments
of u32 indicate the ht:bucket:node build a handle and this is how we
order rules.

When I test this I create a veth pair and create the same rule set on
the veth pair as my hardware netdev. Then inject a skb into both
the veth and hardware netdev if you get different results its a bug.

>>
>> The rules are put in order by the handles which is populated in
>> my command above such that 'ht 1: order 3' gives handle 1::3 and
>> 'ht 800: order 1' gives 800::1. Take a look at this block in cls_u32
>>
>>  if (err == 0) {
>>  struct tc_u_knode __rcu **ins;
>>  struct tc_u_knode *pins;
>>
>>  ins = &ht->ht[TC_U32_HASH(handle)];
>>  for (pins = rtnl_dereference(*ins); pins;
>>   ins = &pins->next, pins = rtnl_dereference(*ins))
>>  if (TC_U32_NODE(handle) <
>> TC_U32_NODE(pins->handle))
>>  break;
>>
>>  RCU_INIT_POINTER(n->next, pins);
>>  rcu_assign_pointer(*ins, n);
>>  u32_replace_hw_knode(tp, n);
>>  *arg = (unsigned long)n;
>>  return 0;
>>
>>
>> If you leave ht and order off the tc cli I believe 'tc' just
>> picks some semi-arbitrary ones for you. I've been in the habit
>> of always specifying them even for software filters.
>>
> 
> The default table id is essentially 0x800. Default bucket is 0.
> "order" essentially is the filter id. And given you can link tables
> (Nice work John!); essentially the ht:bucket:nodeid is an "address" to
> a specific filter on a specific table and when makes sense a specific
> hash bucket. Some other way to look at it is as a way to construct
> a mapping to a TCAM key.

Sure as long as your mapping works/looks like the software only case
it doesn't matter how you do the hardware mapping and my guess is this
is going to be highly device specific. Even with the handful of devices
I have it looks different depending on the device.

Note if you don't do the table links there really is no safe
way to get into the headers beyond the IP header because you need the
nexthdr stuff. Also just to stick this out there I have a patch to let
cls_u32 start processing at the ethernet header instead of the
transport header similar to how bpf works. This negative offset business
doesn't really work as best I can tell.

> What John is doing is essentially taking the nodeid and trying to use
> it as a priority. In otherwise the abstraction is reduced to a linked
> list in which the ordering is how the list is traversed.
> It may work in this case, but i am for being able to explicitly specify
> priorities.

But that doesn't exist in software today. If users want explicit order
today they build the ht, nodeid out correctly just use that. If you
are working on a TCAM just use the nodeid that should be equivalent to
priority.

And  although the ixgbe supports only a single table the mapping on more
complex devices will take it onto multiple tables if that optimizes
things.

Note I need to do a couple fixes on my existing code one to abort when
given a bad ifindex and two to hard abort when a hashtable is given
a higher order rule that I can't support. Both are fairly small tweaks
to the ixgbe code not to the infrastructure.

> 
> cheers,
> jamal
> 


Re: [net-next PATCH 0/7] tc offload for cls_u32 on ixgbe

2016-02-03 Thread Fastabend, John R
On 2/3/2016 4:21 AM, Jamal Hadi Salim wrote:
> On 16-02-03 05:31 AM, Or Gerlitz wrote:
>> On 2/3/2016 12:21 PM, John Fastabend wrote:
>>> Thanks, we will need at least a v2 to fixup some build errors
>>> with various compile flags caught by build_bot and missed by me.
>> Hi John,
>>
>> You didn't mark that as RFC... but we said this direction/approach yet
>> to be talked @ netdev next-week, so.. can you clarify?

Yeah I think this set of patches is ready and it really is what we've
been talking about doing for two or three conferences now. Also I don't
know where "we" decided to talk about this @ netdev or how that became
a prereq for patches. I think this is the correct approach and I am not
seeing any contentious pieces here so why not consider it for inclusion.
Do you have some issue with the approach? I don't recall any when we
talked about this last time.

>>
>> I suggest not to rush and asking pulling this, lets have the tc workshop
>> beforehand...
>>

rush? we've been talking about it for a year+

> 
> Yes, the tc workshop is a good place for this.
> I think we can spill some of it into the switchdev workshop (which is a
> nice flow since that happens later).
> 

Sure but this patch is really the basic stuff we should move on to
some of the more interesting pieces. I have ~30 or so patches behind
this that do the fun stuff like resource allocation, capababilities,
support for the divisor > 1, some new actions, etc.

> Some comments:
> 1) "priorities" for filters and some form of "index" for actions is
> is needed. I think index (which tends to be a 32 bit value is what
> Amir's patches refered to as "cookie" - or at least some hardware
> can be used to query the action with). Priorities maybe implicit in
> the order in which they are added. And th idea of appending vs
> exclusivity vs replace (which  netlink already supports)
> is important to worry about (TCAMS tend to assume an append mode
> for example).

The code denotes add/del/replace already. I'm not sure why a TCAM
would assume an append mode but OK maybe that is some API you have
the APIs I use don't have these semantics.

For this series using cls_u32 the handle gives you everything you need
to put entries in the right table and row. Namely the ht # and order #
from 'tc'. Take a look at u32_change and u32_classify its the handle
that places the filter into the list and the handle that is matched in
classify. We should place the filters in the hardware in the same order
that is used by u32_change.

Also ran a few tests and can't see how priority works in u32 maybe you
can shed some light but as best I can tell it doesn't have any effect
on rule execution.

> 
> 2) I like the u32 approach where it makes sense; but sometimes it
> doesnt make sense from a usability pov. I work with some ASICs
> that have 10 tuples that are  fixed. Yes, a user can describe a policy
> with u32 but flower would be more  usable say with flower (both
> programmatic and cli)

Sure so create a set of offload hooks for flower we don't need only
one hardware classifier any more than we would like a single software
classifiers. I'll send out my flower patches when I get to a real system
I'm on a corporate laptop at the moment.

> 
> 3) The concept of "hook address" is important to be able to express.
> Amir's patches seemed to miss that (and John brought it up in an
> email). It could be as simple as ifindex + hookid. With ifindex of
> 0 meaning all ports and maybe hookid of 0 meaning all hooks.
> Hook semantics are as mentioned by John (as it stands right now
> in/egress)
> 

Again I'm trying to faithfully implement what we have in software
and load that into the hardware. The handle today gives ingress/egres
hook. If you want an all ports hook we should add it to 'tc' software
first and then push that to the hardware not create magic hardware
bits. See I've drank the cool aid software first than hardware.

> 4) Why are we forsaking switchdev John?
> This is certainly re-usable beyond NICs and SRIOV.
> 

Sure and switchdev can use it just like they use fdb_add and friends.
I just don't want to require switchdev infrastructure on things that
really are not switches. I think Amir indicated he would take a try
at the switchdev integration. If not I'm willing to do it but it
doesn't block this series in any way imo.

> 5)What happened to being both able to hardware and/or software?

Follow up patch once we get the basic infrastructure in place with
the big feature flag bit. I have a patch I'm testing for this now
but again I want to move in logical and somewhat minimal sets.

> 
> Anyways, I think Seville would be a blast! Come one, come all.
> 

I'll be there but lets be sure to follow up with this online I
know folks are following this who wont be at Seville and I don't
see any reason to block these patches and stop the thread for a
week or more.

> cheers,
> jamal
> 
>