RE: XDP seeking input from NIC hardware vendors
Hi Jesper, I have done some previous work on proprietary systems where we used hardware to do the classification/parsing then passed a cookie to the software which used the cookie to lookup a program to run on the packet. When your programs are structured as a bunch of parsing followed by some actions this can provide real performance benefits. Also a lot of existing hardware supports this today assuming you use headers the hardware "knows" about. It's a natural model for hardware that uses a parser followed by tcam/cam/sram/etc lookup tables. If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software. The other interesting thing would be to do more than just packet steering but actually run a more complete XDP program. Netronome supports this right. The question I have though is this a stacked of XDP programs one or more designated for hardware and some running in software perhaps with some annotation in the program so the hardware JIT knows where to place programs or do we expect the JIT itself to try and decide what is best to offload. I think the easiest to start with is to annotate the programs. Also as far as I know a lot of hardware can stick extra data to the front or end of a packet so you could push metadata calculated by the program here in a generic way without having to extend XDP defined metadata structures. Another option is to DMA the metadata to a specified address. With this metadata the consumer/producer XDP programs have to agree on the format but no one else. FWIW I was hoping to get some data to show performance overhead vs how deep we parse into the packets. I just wont have time to get to it for awhile but that could tell us how much perf gain the hardware could provide. Thanks, John -Original Message- From: Jesper Dangaard Brouer [mailto:bro...@redhat.com] Sent: Thursday, July 7, 2016 3:43 AM To: iovisor-...@lists.iovisor.org Cc: bro...@redhat.com; Brenden Blanco ; Alexei Starovoitov ; Rana Shahout ; Ari Saha ; Tariq Toukan ; Or Gerlitz ; netdev@vger.kernel.org; Simon Horman ; Simon Horman ; Jakub Kicinski ; Edward Cree ; Fastabend, John R Subject: XDP seeking input from NIC hardware vendors Would it make sense from a hardware point of view, to split the XDP eBPF program into two stages. Stage-1: Filter (restricted eBPF / no-helper calls) Stage-2: Program Then the HW can choose to offload stage-1 "filter", and keep the likely more advanced stage-2 on the kernel side. Do HW vendors see a benefit of this approach? The generic problem I'm trying to solve is parsing. E.g. that the first step in every XDP program will be to parse the packet-data, in-order to determine if this is a packet the XDP program should process. Actions from stage-1 "filter" program: - DROP (like XDP_DROP, early drop) - PASS (like XDP_PASS, normal netstack) - MATCH (call stage-2, likely carry-over opaque return code) The MATCH action should likely carry-over an opaque return code, that makes sense for the stage-2 program. E.g. proto id and/or data offset. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- Intel Research and Development Ireland Limited Registered in Ireland Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload
[...] >> >> If you leave ht and order off the tc cli I believe 'tc' just >> picks some semi-arbitrary ones for you. I've been in the habit >> of always specifying them even for software filters. >> > > The default table id is essentially 0x800. Default bucket is 0. > "order" essentially is the filter id. And given you can link tables > (Nice work John!); essentially the ht:bucket:nodeid is an "address" to > a specific filter on a specific table and when makes sense a specific > hash bucket. Some other way to look at it is as a way to construct > a mapping to a TCAM key. > What John is doing is essentially taking the nodeid and trying to use > it as a priority. In otherwise the abstraction is reduced to a linked > list in which the ordering is how the list is traversed. > It may work in this case, but i am for being able to explicitly specify > priorities. Sorry bombing you with emails Jamal. Another thing to note is ixgbe doesn't support hash tables explicitly but our other devices do. So when a hash node is created we can map that onto a hardware block and actually do the hash. > > cheers, > jamal >
Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload
[...] >> Ah I should have annotated this in the commit msg. I turn the feature >> off by default to enable it the user needs to run >> >> # ethtool -K ethx hw-tc-offload on >> >> this is just a habit of mine to leave new features off by default for >> a bit until I work out some of the kinks. For example I found a case >> today where if you build loops into your u32 graph the hardware tables >> can get out of sync with the software tables. This is sort of extreme >> corner case not sure if anyone would really use u32 but it is valid >> and the hardware should abort correctly. > Yeh - that is nice :) But I was just pointing out on a small typo which I > think you have. > The new case will never happen. You compare: (features & NETIF_F_NTUPLE) == > NETIF_F_HW_TC > Also the comment before the switch should be modified. Aha nice catch my scripts were enabling both ntuple and hw-tc-offload for testing compatibility issues. I wonder if there is a bug somewhere else though that checks that code most likely because it was definately getting offloaded. Good catch again thanks. > >> >> Thanks, >> John >>
Re: [net-next PATCH 0/7] tc offload for cls_u32 on ixgbe
On 2/4/2016 5:12 AM, Jamal Hadi Salim wrote: > > On 16-02-03 01:48 PM, Fastabend, John R wrote: > > BTW: For the record John, I empathize with you that we need to > move. Please have patience - we are close; lets just get this resolved > in Seville. I like your patches a lot and would love to just have > your patches pushed in, but the challenges with community is being able > to reach some middle ground. We are not as bad as some of the standards > organizations. I am sure we'll get this resolved by end of next week > if not, I am %100 in agreement some form of your patches (And Amir's > need to go in and then we can refactor as needed) Agreed although I'm a bit worried we are starting to talk about a single hardware IR. This discussion has always failed in my experience. > >>> 1) "priorities" for filters and some form of "index" for actions is >>> is needed. I think index (which tends to be a 32 bit value is what >>> Amir's patches refered to as "cookie" - or at least some hardware >>> can be used to query the action with). Priorities maybe implicit in >>> the order in which they are added. And th idea of appending vs >>> exclusivity vs replace (which netlink already supports) >>> is important to worry about (TCAMS tend to assume an append mode >>> for example). >> >> The code denotes add/del/replace already. I'm not sure why a TCAM >> would assume an append mode but OK maybe that is some API you have >> the APIs I use don't have these semantics. >> > > Basically most hardware (or i should say driver implementations of > mostly TCAMS) allow you to add exactly the same filter as many times > as you want. They dont really look at what you want to filter on > and then scream "conflict". IOW, you (user) are responsible for > conflict resolution at the filter level. The driver sees this blob > and requests for some index/key from the hardware then just adds it. > You can then use this key/index to delete/replace etc. > This is what i meant by "append" mode. > However if a classifier implementation cares about filter ambiguity > resolution, then priorities are used. We need to worry about the > bigger picture. > Sure in other classifiers its used but its not needed in the set I planned to added it later. > >> For this series using cls_u32 the handle gives you everything you need >> to put entries in the right table and row. Namely the ht # and order # >> from 'tc'. > > True - but with a caveat. There are only 2^12 max tables you can > have for example and up to 2^12 filters per bucket etc. > This is a software limitation as well right? If it hasn't showed up as a limitation on the software side why would it be an issue here? Do you have more than 2^12 tables on your devices? If so I guess we can tack on another 32bits somewhere. >> Take a look at u32_change and u32_classify its the handle >> that places the filter into the list and the handle that is matched in >> classify. We should place the filters in the hardware in the same order >> that is used by u32_change. >> > > I can see some parallels, but: > The nodeid in itself is insufficent for two reasons: > You cant have more than 2^12 filters per bucket; > and the nodeid then takes two meanings: a) it is an id > b) it specifies the order in which things are looked up. > > I think you need to take the u32 address and map it to something in your > hardware. But at the same time it is important to have the abstraction > closely emulate your hardware. > IMO the hardware/interface must preserve the same ordering of filters/hash_Tables/etc. How it does that mapping should be a driver concern and it can always abort if it fails. >> Also ran a few tests and can't see how priority works in u32 maybe you >> can shed some light but as best I can tell it doesn't have any effect >> on rule execution. >> > > True. > u32 doesnt care because it will give you a nodeid if you dont specify > one. i.e conflict resolution is mapped to you not specifying exactly > the same ht:bkt:nodeid more than once. And if you will let the > kernel do it for you (as i am assumming you are saying your hardware > will) then no need. Yep. Faithfully offloading u32 here not changing anything except I do have to abort on some cases with the simpler devices. fm10k for example can model hash nodes with divisors > 1. > >>> >>> 2) I like the u32 approach where it makes sense; but sometimes it >>> doesnt make sense from a usability pov. I work with some ASICs >>> that have 10 tuples that are fixed. Yes, a user can descri
Re: [net-next PATCH 3/7] net: sched: add cls_u32 offload hooks for netdevs
On 2/4/2016 5:18 AM, Amir Vadai" wrote: > On Wed, Feb 03, 2016 at 01:28:37AM -0800, John Fastabend wrote: >> This patch allows netdev drivers to consume cls_u32 offloads via >> the ndo_setup_tc ndo op. >> >> This works aligns with how network drivers have been doing qdisc >> offloads for mqprio. >> >> Signed-off-by: John Fastabend >> --- [...] >> +enum { >> +TC_CLSU32_NEW_KNODE, > TC_CLSU32_NEW_KNODE is never used aha yep that snuck in there. In a follow up patch for the fm10k devices where we can support hash tables (e.g. divisor > 1) I use it. Although on closer inspection I need to check that the divisor == 1 on ixgbe or else abort because we can get out of sync if software expects hash tables here. Thanks, nice catch. > > [...] >
Re: [net-next PATCH 0/7] tc offload for cls_u32 on ixgbe
On 2/4/2016 3:19 PM, Pablo Neira Ayuso wrote: > On Thu, Feb 04, 2016 at 10:16:56AM +0100, Jiri Pirko wrote: >> Wed, Feb 03, 2016 at 10:27:32AM CET, john.fastab...@gmail.com wrote: >>> >>> Also by adding get_parse_graph and set_parse_graph attributes as >>> in my previous flow_api work we can build programmable devices >>> and programmatically learn when rules can or can not be loaded >>> into the hardware. Again future work. >>> >>> Any comments/feedback appreciated. Sorry if you get this twice it doesn't look like my original response made it to netdev and the laptop I replied on charger blew up. >> >> I like this being thin and elegant solution. However, ~2 years ago when I >> pushed openvswitch kernel datapath offload patchset, people were yelling >> at me that it is not generic enough solution, that tc has to be able >> to use the api (Jamal :)), nftables as well. > The other problem with OVS is if you have the capabilities to do wildcard lookups (e.g. TCAM/SRAM/etc) then offloading the exact match table in OVS is really inefficient use of the resource. You really want to load the megaflow table into hardware. I just don't think its a good scheme for what you want. > I would be glad to join this debate during NetDev 1.1 too. > great. > I think we should provide a solution that allows people uses both > tc and nftables, this would require a bit of generic infrastructure on > top of it so we don't restrict users to one single solution, in other > words, we allow the user to select its own poison. > >> Now this patch is making offload strictly tc-based and nobody seems to >> care :) I do. I think that we might try to find some generic middle >> layer. If we can build the universal model for 'tc' and 'nftable' we should unify them higher in the stack? It doesn't make sense to me for the driver folks to try and create the unified model for two subsystems if we don't think its worthwhile in software as well. > > I agree and I'll be happy to help to push this ahead. Let's try to sit > and get together to resolve this. Great. > > See you soon. >
Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload
On 2/3/2016 11:30 PM, Amir Vadai" wrote: > On Wed, Feb 03, 2016 at 01:29:59AM -0800, John Fastabend wrote: >> This adds initial support for offloading the u32 tc classifier. This >> initial implementation only implements a few base matches and actions >> to illustrate the use of the infrastructure patches. >> >> However it is an interesting subset because it handles the u32 next >> hdr logic to correctly map tcp packets from ip headers using the ihl >> and protocol fields. After this is accepted we can extend the match >> and action fields easily by updating the model header file. >> >> Also only the drop action is supported initially. >> >> Here is a short test script, >> >> #tc qdisc add dev eth4 ingress >> #tc filter add dev eth4 parent : protocol ip \ >> u32 ht 800: order 1 \ >> match ip dst 15.0.0.1/32 match ip src 15.0.0.2/32 action drop >> >> <-- hardware has dst/src ip match rule installed --> >> >> #tc filter del dev eth4 parent : prio 49152 >> #tc filter add dev eth4 parent : protocol ip prio 99 \ >> handle 1: u32 divisor 1 >> #tc filter add dev eth4 protocol ip parent : prio 99 \ >> u32 ht 800: order 1 link 1: \ >> offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff >> #tc filter add dev eth4 parent : protocol ip \ >> u32 ht 1: order 3 match tcp src 23 action drop >> >> <-- hardware has tcp src port rule installed --> >> >> #tc qdisc del dev eth4 parent : >> >> <-- hardware cleaned up --> >> >> Signed-off-by: John Fastabend >> --- >> drivers/net/ethernet/intel/ixgbe/ixgbe.h |3 >> drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |6 - >> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 196 >> ++ >> 3 files changed, 198 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h >> b/drivers/net/ethernet/intel/ixgbe/ixgbe.h >> index 4b9156c..09c2d9b 100644 >> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h >> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h > > [...] > >> @@ -8277,6 +8465,7 @@ static int ixgbe_set_features(struct net_device >> *netdev, >> */ >> switch (features & NETIF_F_NTUPLE) { >> case NETIF_F_NTUPLE: >> +case NETIF_F_HW_TC: >> /* turn off ATR, enable perfect filters and reset */ >> if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)) >> need_reset = true; > > I think you have a bug here. I don't see how the NETIF_F_HW_TC case will > happen after masking 'features' out. > Ah I should have annotated this in the commit msg. I turn the feature off by default to enable it the user needs to run # ethtool -K ethx hw-tc-offload on this is just a habit of mine to leave new features off by default for a bit until I work out some of the kinks. For example I found a case today where if you build loops into your u32 graph the hardware tables can get out of sync with the software tables. This is sort of extreme corner case not sure if anyone would really use u32 but it is valid and the hardware should abort correctly. Thanks, John
Re: [net-next PATCH 7/7] net: ixgbe: add support for tc_u32 offload
On 2/3/2016 4:46 AM, Jamal Hadi Salim wrote: [...] >>> >>> What are you doing w.r.t priorities? Are the filters processed by the >>> order of the priorities? >>> In the same order as software processes filters. I tried to faithfully translate the u32 classify and u32_change loops into hardware. If I missed some case its a bug and I'll fix it. My reading and experiments of u32 indicate the ht:bucket:node build a handle and this is how we order rules. When I test this I create a veth pair and create the same rule set on the veth pair as my hardware netdev. Then inject a skb into both the veth and hardware netdev if you get different results its a bug. >> >> The rules are put in order by the handles which is populated in >> my command above such that 'ht 1: order 3' gives handle 1::3 and >> 'ht 800: order 1' gives 800::1. Take a look at this block in cls_u32 >> >> if (err == 0) { >> struct tc_u_knode __rcu **ins; >> struct tc_u_knode *pins; >> >> ins = &ht->ht[TC_U32_HASH(handle)]; >> for (pins = rtnl_dereference(*ins); pins; >> ins = &pins->next, pins = rtnl_dereference(*ins)) >> if (TC_U32_NODE(handle) < >> TC_U32_NODE(pins->handle)) >> break; >> >> RCU_INIT_POINTER(n->next, pins); >> rcu_assign_pointer(*ins, n); >> u32_replace_hw_knode(tp, n); >> *arg = (unsigned long)n; >> return 0; >> >> >> If you leave ht and order off the tc cli I believe 'tc' just >> picks some semi-arbitrary ones for you. I've been in the habit >> of always specifying them even for software filters. >> > > The default table id is essentially 0x800. Default bucket is 0. > "order" essentially is the filter id. And given you can link tables > (Nice work John!); essentially the ht:bucket:nodeid is an "address" to > a specific filter on a specific table and when makes sense a specific > hash bucket. Some other way to look at it is as a way to construct > a mapping to a TCAM key. Sure as long as your mapping works/looks like the software only case it doesn't matter how you do the hardware mapping and my guess is this is going to be highly device specific. Even with the handful of devices I have it looks different depending on the device. Note if you don't do the table links there really is no safe way to get into the headers beyond the IP header because you need the nexthdr stuff. Also just to stick this out there I have a patch to let cls_u32 start processing at the ethernet header instead of the transport header similar to how bpf works. This negative offset business doesn't really work as best I can tell. > What John is doing is essentially taking the nodeid and trying to use > it as a priority. In otherwise the abstraction is reduced to a linked > list in which the ordering is how the list is traversed. > It may work in this case, but i am for being able to explicitly specify > priorities. But that doesn't exist in software today. If users want explicit order today they build the ht, nodeid out correctly just use that. If you are working on a TCAM just use the nodeid that should be equivalent to priority. And although the ixgbe supports only a single table the mapping on more complex devices will take it onto multiple tables if that optimizes things. Note I need to do a couple fixes on my existing code one to abort when given a bad ifindex and two to hard abort when a hashtable is given a higher order rule that I can't support. Both are fairly small tweaks to the ixgbe code not to the infrastructure. > > cheers, > jamal >
Re: [net-next PATCH 0/7] tc offload for cls_u32 on ixgbe
On 2/3/2016 4:21 AM, Jamal Hadi Salim wrote: > On 16-02-03 05:31 AM, Or Gerlitz wrote: >> On 2/3/2016 12:21 PM, John Fastabend wrote: >>> Thanks, we will need at least a v2 to fixup some build errors >>> with various compile flags caught by build_bot and missed by me. >> Hi John, >> >> You didn't mark that as RFC... but we said this direction/approach yet >> to be talked @ netdev next-week, so.. can you clarify? Yeah I think this set of patches is ready and it really is what we've been talking about doing for two or three conferences now. Also I don't know where "we" decided to talk about this @ netdev or how that became a prereq for patches. I think this is the correct approach and I am not seeing any contentious pieces here so why not consider it for inclusion. Do you have some issue with the approach? I don't recall any when we talked about this last time. >> >> I suggest not to rush and asking pulling this, lets have the tc workshop >> beforehand... >> rush? we've been talking about it for a year+ > > Yes, the tc workshop is a good place for this. > I think we can spill some of it into the switchdev workshop (which is a > nice flow since that happens later). > Sure but this patch is really the basic stuff we should move on to some of the more interesting pieces. I have ~30 or so patches behind this that do the fun stuff like resource allocation, capababilities, support for the divisor > 1, some new actions, etc. > Some comments: > 1) "priorities" for filters and some form of "index" for actions is > is needed. I think index (which tends to be a 32 bit value is what > Amir's patches refered to as "cookie" - or at least some hardware > can be used to query the action with). Priorities maybe implicit in > the order in which they are added. And th idea of appending vs > exclusivity vs replace (which netlink already supports) > is important to worry about (TCAMS tend to assume an append mode > for example). The code denotes add/del/replace already. I'm not sure why a TCAM would assume an append mode but OK maybe that is some API you have the APIs I use don't have these semantics. For this series using cls_u32 the handle gives you everything you need to put entries in the right table and row. Namely the ht # and order # from 'tc'. Take a look at u32_change and u32_classify its the handle that places the filter into the list and the handle that is matched in classify. We should place the filters in the hardware in the same order that is used by u32_change. Also ran a few tests and can't see how priority works in u32 maybe you can shed some light but as best I can tell it doesn't have any effect on rule execution. > > 2) I like the u32 approach where it makes sense; but sometimes it > doesnt make sense from a usability pov. I work with some ASICs > that have 10 tuples that are fixed. Yes, a user can describe a policy > with u32 but flower would be more usable say with flower (both > programmatic and cli) Sure so create a set of offload hooks for flower we don't need only one hardware classifier any more than we would like a single software classifiers. I'll send out my flower patches when I get to a real system I'm on a corporate laptop at the moment. > > 3) The concept of "hook address" is important to be able to express. > Amir's patches seemed to miss that (and John brought it up in an > email). It could be as simple as ifindex + hookid. With ifindex of > 0 meaning all ports and maybe hookid of 0 meaning all hooks. > Hook semantics are as mentioned by John (as it stands right now > in/egress) > Again I'm trying to faithfully implement what we have in software and load that into the hardware. The handle today gives ingress/egres hook. If you want an all ports hook we should add it to 'tc' software first and then push that to the hardware not create magic hardware bits. See I've drank the cool aid software first than hardware. > 4) Why are we forsaking switchdev John? > This is certainly re-usable beyond NICs and SRIOV. > Sure and switchdev can use it just like they use fdb_add and friends. I just don't want to require switchdev infrastructure on things that really are not switches. I think Amir indicated he would take a try at the switchdev integration. If not I'm willing to do it but it doesn't block this series in any way imo. > 5)What happened to being both able to hardware and/or software? Follow up patch once we get the basic infrastructure in place with the big feature flag bit. I have a patch I'm testing for this now but again I want to move in logical and somewhat minimal sets. > > Anyways, I think Seville would be a blast! Come one, come all. > I'll be there but lets be sure to follow up with this online I know folks are following this who wont be at Seville and I don't see any reason to block these patches and stop the thread for a week or more. > cheers, > jamal > >