Re: [dpdk-dev] [PATCH v2 0/2] AVX2 Vectorized Rx/Tx functions for i40e

2018-01-09 Thread John Fastabend
On 01/09/2018 06:32 AM, Bruce Richardson wrote:
> This patch adds an AVX2 vectorized path to the i40e driver, based on the
> existing SSE4.2 version. Using AVX2 instructions gives better performance
> than the SSE version, though the percentage increase depends on the exact
> settings used. For example:
> 

Hi Bruce,

Just curious, can you provide some hints on percent increase in at least
some representative cases? I'm just trying to get a sense of if this is
%5, 10%, 20%, more... I know mileage will vary depending on system, setup,
configuration, etc.

Thanks,
John

> * Using 16B rather than 32B descriptors gives the biggest benefit since
>   2 descriptors at a time can be read, rather than just 1 when 32B ones
>   are used.
> * Bigger burst sizes for RX gives improved performance - while we see an
>   improvement with testpmd with the default burst size of 32, burst sizes
>   of up to 128 give further improvements
> * In my testing, most of the improvement comes from faster processing on
>   the RX path, though the improved TX also gives benefit.
> 
> This has been tested on a system with CPU: "Intel(R) Xeon(R) Gold 6154 CPU
> @ 3.00GHz", and I've focused on testing with Rx ring sizes of approx 1k -
> generally --rxd=1024 and --txd=512, rather than the defaults which tend to
> give poorer zero-loss performance due to the smaller amount of buffering.
> 
> V2:
> * Fixed incorrect config variable reference in makefile
> * Added missing stub function for when vector drivers are disabled
> * Added missing references to the new functions when checking for vector
>   code paths, e.g. for ring tear-down
> 
> Bruce Richardson (2):
>   net/i40e: add AVX2 Tx function
>   net/i40e: add AVX2 Rx function
> 
>  drivers/net/i40e/Makefile |  19 +
>  drivers/net/i40e/i40e_rxtx.c  |  66 ++-
>  drivers/net/i40e/i40e_rxtx.h  |   6 +
>  drivers/net/i40e/i40e_rxtx_vec_avx2.c | 792 
> ++
>  4 files changed, 880 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/net/i40e/i40e_rxtx_vec_avx2.c
> 



Re: [dpdk-dev] [PATCH v8 00/25] Support VFD on i40e

2017-01-13 Thread John Fastabend
On 17-01-11 06:51 AM, JOSHI, KAUSTUBH  (KAUSTUBH) wrote:
> Also, the kernel drivers have no concept of passing VF messages to upstream 
> "decision making” (or policy enforcement) software like VFd.
> 
> On Jan 11, 2017, at 9:49 AM, Kaustubh Joshi 
> mailto:kaust...@research.att.com>> wrote:
> 
> When Alex from our team started working on Niantic last year, the following 
> were the list of gaps in the kernel drivers we had a need to fill:
> 

Thanks for the list its nice to see concrete examples. I can
provide latest upstream status for what its worth.

> Direct traffic to VF based on more than one outer VLAN tags

Kernel supports this but ixgbe/i40e drivers need to pull in code to
enable it.

> Optionally strip on ingress (to PF) and insert on egress VLAN tag

This is just a PVID per VF right? This should be working now on
ixgbe I would have to check though.

> Disable/enable MAC and VLAN anti spoofing separately

Needs kernel patch.

> Mirror traffic from one VF to the other

Kernel supports this but driver is missing feature.

> Enable/Disable local switching per VF

Local switching? I'm guessing this means you want all traffic to
go to network regardless of MAC/etc. A sort of per port VEPA mode?

> Collect statistics per VF pkts/octets in/out

Under development.

> Enable/disable Mcast/unknown unicast per VF

Under development.

> Manage up to 8 TC per VF with one strict priority queue
> Manage per VF per TC bandwidth allocations

Needs kernel patch.

> Manage LACP status visibility to the VFs (for NIC teaming using SRIOV)

I've not even thought to do this yet so its a good catch.

> 
> Most of these are VF management functions, and there is no standardized way 
> to do VF management in the kernel drivers. Besides, most of the use-cases 
> around SRIOV need DPDK in the VF anyway (so the target communities are 
> aligned) and the PF DPDK driver for ixgbe already existed, so it made sense 
> to add them there - no forking of the PF driver was involved and there is no 
> additional duplicate code.
> 

So I wont argue against we already have DPDK and VFs are DPDK and
updating is a problem per other email. But I think we can at least
get kernel support for the above.

Thanks,
John

> Cheers
> 
> KJ
> 
> 
> On Jan 11, 2017, at 6:03 AM, Vincent Jardin 
> mailto:vincent.jar...@6wind.com>> wrote:
> 
> Please can you list the gaps of the Kernel API?
> 
> Thank you,
> Vincent
> 
> 
> Le 11 janvier 2017 3:59:45 AM "JOSHI, KAUSTUBH  (KAUSTUBH)" 
> mailto:kaust...@research.att.com>> a écrit :
> 
> Hi Vincent,
> 
> Greetings! Jumping into this debate a bit late, but let me share our point of 
> view based on how we are using this code within AT&T for our NFV cloud.
> 
> Actually, we first started with trying to do the configuration within the 
> kernel drivers as you suggest, but quickly realized that besides the 
> practical problem of kernel upstreaming being a much more arduous road (which 
> can be overcome), the bigger problem was that there is no standardization in 
> the configuration interfaces for the NICs in the kernel community. So 
> different drivers do things differently and expose different settings, and no 
> forum exists to drive towards such standardization. This was leading to 
> vendors have to maintain patched versions of drivers for doing PF 
> configuration, which is not a desirable situation.
> 
> So, to build a portable (to multiple NICs) SRIOV VF manager like VFd, DPDK 
> seemed like a good a forum with some hope for driving towards a standard set 
> of interfaces and without having to worry about a lot of legacy baggage and 
> old hardware. Especially since DPDK already takes on the role of configuring 
> NICs for the data plane functions anyway - both PF and VF drivers will have 
> to be included for data plane usage anyway - we viewed that adding VF config 
> options will not cause any forking, but simply flush out the DPDK drivers and 
> their interfaces to be more complete. These APIs could be optional, so new 
> vendors aren’t obligated to add them.
> 
> Furthermore, allowing VF config using the DPDK PF driver also has the side 
> benefit of allowing a complete SRIOV system (both VF and PF) to be built 
> entirely with DPDK, also making version alignment easier.
> 
> We started with Niantic, which already had PF and VF drivers, and things have 
> worked out very well with it. However, we would like VFd to be a multi-NIC 
> vendor agnostic VF management tool, which is why we’ve been asking for making 
> the PF config APIs richer.
> 
> Regards
> 
> KJ
> 
> 
> On Jan 10, 2017, at 3:23 PM, Vincent Jardin 
> mailto:vincent.jar...@6wind.com>> wrote:
> 
> Nope. First one needs to assess if DPDK should be intensively used to become 
> a PF knowing Linux can do the jobs. Linux kernel community does not like the 
> forking of Kernel drivers, I tend to agree that we should not keep 
> duplicating options that can be solved with the Linux kernel.
> 
> Best regards,
> Vincent
> 
> 
> 
> 
> 
> 
> 

Re: [dpdk-dev] [PATCH v5 00/29] Support VFD and DPDK PF + kernel VF on i40e

2017-01-13 Thread John Fastabend
On 17-01-11 07:47 PM, Chen, Jing D wrote:
> Hi, Vincent,
> 
>> -Original Message-
>> From: Vincent JARDIN [mailto:vincent.jar...@6wind.com]
>> Sent: Tuesday, January 10, 2017 9:30 PM
>> To: Scott Daniels ; dev@dpdk.org
>> Cc: kaust...@research.att.com; az5...@att.com; Chen, Jing D
>> 
>> Subject: Re: [dpdk-dev] [PATCH v5 00/29] Support VFD and DPDK PF + kernel VF 
>> on
>> i40e
>>
>> Hi Scott,
>>
>> Le 04/01/2017 à 22:09, Scott Daniels a écrit :
>>>  With holidays we are a bit late with our thoughts, but would like to
>>>  toss them into the mix.
>>
>> Same, I hope I am not missing emails. I do appreciate your arguments, it
>> provides lot of light. See below,
>>
>>>  The original NAK is understandable, however having the ability to
>>>  configure the PF via DPDK is advantageous for several reasons:
>>>
>>>  1) While some functions may be duplicated and/or available from the kernel
>>>  driver, it is often not possible to introduce new kernel drivers into
>>>  production without a large amount of additional testing of the entire
>>>  platform which can cause a significant delay when introducing a DPDK based
>>>  product.  If the PF control is a part of the DPDK environment, then only
>>>  the application needs to pass the operational testing before deployment; a
>>>  much more simple task.
>>
>> So we agree: you confirm that your foresee the benefits of using DPDK to
>> *bypass the role of the Kernel being the PF* of reference for the
>> hypervisor.
>>
>>>  2) If the driver changes are upstreamed into the kernel proper, the
>>>  difficulty of operational readiness testing increases as a new kernel is
>>>  introduced, and further undermines the ability to quickly and easily
>>>  release a DPDK based application into production.  While the application
>>>  may eventually fall back on driver and/or kernel support, this could be
>>>  years away.
>>
>> I do agree with the benefits of the agilities and the upsides it brings.
>> But they are other options to get the same agility without creating a
>> fragmentation of PFs.
>>
>> For example, you do not have to update the whole kernel, you can just
>> update the PF kernel module that can be upgraded with the latest needed
>> features.
>>
>>>  3) As DPDK is being used to configure the NIC, it just seems to make
>>>  sense, for consistency, that the configuration capabilities should include
>>>  the ability to configure the PF as is proposed.
>>
>>  From this perspective, the kernel modules are fine for the PF: most
>> kernels of hypervisors support it without the need to upgrade their kernels.
>>
>> To summarize, I understand that you need a flexible way to upgrade PF
>> features without touching/changing the kernel. So let's check the kernel
>> module option? VFD brings some interesting capabilities, could it be a
>> way to push and stimulate the i40e features instead of using DPDK?
>>
>>https://sourceforge.net/projects/e1000/files/i40e%20stable/
>> for instance could be better stimulated.
> 

I figured I would weight in with my $0.02.

> May I ask what if DPDK VF need a new extension function from PF?
> Then, we'll have to build kernel community expertise and submit
> patch to Linux PF and wait for merge.

Correct, but it is the same story on DPDK repository as well. Either
way patch must be generated, tested, submitted and wait until it is
merged. At this point I think even the code is very similar sans
style and naming conventions for these types of features.

> Your proposal indicates DPDK community submitters will have to
> ask Linux community to authorize if we'll have any requirements
> in DPDK VF.

I don't think this is really much of a problem in most cases the same
company maintains both DPDK and Linux kernel drivers so hopefully
everyone is aligned.

> Comparing fragmentation, the extra dependency worry me most.
> Can you imagine how long it will be for any VF features gets
> ready?  
> 
> 

The Linux kernel releases every 6-8 weeks roughly depending on the
release. In networking stack land features must be submitted for
that kernel release before the first RC release.

So although there may be benefits to pushing these features into
DPDK vs Linux kernel I don't think actually getting the code in
the kernel is one of them.

To the point above about out of tree drivers. Most vendors do keep
out of tree drivers with the latest features and fixes that can be
supported on older kernels. Unfortunately, many of the interfaces
like the type mentioned here would require kernel updates outside
the drivers to add message handlers and driver callbacks for the
new config messages. So it may not be viable at least unless we
restructure some of the kernel to use out of tree drivers in the
short term.

So I'm sympathetic to argument (1) here. If _not_ updating the
kernel is a hard requirement then we are a bit stuck from using
existing/future Linux features. Further this is sort of a rolling
problem as well. Today we are missing feature X and we add it and

Re: [dpdk-dev] [PATCH v2 00/25] Generic flow API (rte_flow)

2017-01-04 Thread John Fastabend
[...]

>>> Well, it should be easier to answer if you have a specific use-case in mind
>>> you would like to support but that cannot be expressed with the API as
>>> defined in [1], in which case please share it with the community.
>>
>> A use-case would be implementing OvS DPIF flow offload using this API.
> 
> OK, OVS has been mentioned several times in this thread and my understanding
> is that rte_flow seems to accommodate most of its needs according to people
> familiar with it. Perhaps ML archives can answer the remaining questions you
> may have about combining rte_flow with OVS.

For what its worth. I reviewed this and believe it should be sufficient
to support the OVS SR-IOV offload use case with action/classifier extensions
but without any fundamental changes to the design. We built a prototype
OVS offload on top of another API we dubbed Flow-API a year+ ago and there
seems to be a 1:1 mapping between that older API and the one now in DPDK
so I'm happy. And the missing things seem to fit nicely into extensions.

Also I believe the partial pre-classify use cases should be easily handled
as well although I'm not as familiar with the bit-level details of that
implementation.

At some point capability discovery will be useful but we certainly don't need
those in first iteration and rte_flow doesn't preclude these type of extensions
so that is good.

By the way thanks for doing this work Adrien, glad to see it being accepted
and drivers picking it up.

Thanks,
John

> 
>>> [1] http://dpdk.org/ml/archives/dev/2016-December/052954.html
>>> [2] http://dpdk.org/ml/archives/dev/2016-December/052975.html
> 
> [3] http://dpdk.org/doc/guides/prog_guide/rte_flow.html
> 



Re: [dpdk-dev] KNI Questions

2016-12-15 Thread John Fastabend
On 16-12-15 09:16 AM, Stephen Hemminger wrote:
> On Thu, 15 Dec 2016 11:53:59 +
> Ferruh Yigit  wrote:
> 
>> Hi Stephen,
>>
>> <...>
>>
>>>
>>> Which raises a couple of questions:
>>>  1. Why is DPDK still keeping KNI support for Intel specific ethtool 
>>> functionality.
>>> This always breaks, is code bloat, and means a 3rd copy of base code 
>>> (Linux, DPDK PMD, + KNI)  
>>
>> I agree on you comments related to the ethtool functionality,
>> but right now that is a functionality that people may be using, I think
>> we should not remove it without providing an alternative to it.
>>
>>>
>>>  2. Why is KNI not upstream?
>>> If not acceptable due to security or supportablity then why does it 
>>> still exist?  
>>
>> I believe you are one of the most knowledgeable person in the mail list
>> on upstreaming, any support is welcome.
> 
> It should be upstreamable but I doubt it would make it past the maintainer.
> Mostly because it supports DPDK which he is not in favor of but also since
> it is a specialized interface only usable by DPDK, ie. not a general 
> infrastructure.
> 

I was looking to see if we could get more or less the same interface put
in either af_packet or vhost directly. It would work nicely with the XDP
patches where we want to forward a packet into userspace without having
to build skb, etc. So it wouldn't be _just_ a DPDK interface at that
point.

I was hoping to look into it in Jan/Feb I need to wrap a few other
things up first.

.John


[dpdk-dev] [RFC v2] Generic flow director/filtering/classification API

2016-08-22 Thread John Fastabend
On 16-08-19 12:32 PM, Adrien Mazarguil wrote:
> Hi All,
> 
> Thanks to many for the positive and constructive feedback I've received so
> far. Here is the updated specification (v0.7) at last.
> 
> I've attempted to address as many comments as possible but could not
> process them all just yet. A new section "Future evolutions" has been
> added for the remaining topics.
> 
> This series adds rte_flow.h to the DPDK tree. Next time I will attempt to
> convert the specification as a documentation commit part of the patchset
> and actually implement API functions.
> 
> I think including the entire document here makes it easier to annotate on
> the ML, apologies in advance for the resulting traffic.
> 
> Finally I'm off for the next two weeks, do not expect replies from me in
> the meantime.
> 

Hopefully on vacation :)

[...]


> .. raw:: pdf
> 
>PageBreak
> 
> +---+
> | UDP payload matching  |
> +===+===+
> | 0 | Ethernet  |
> +---+---+
> | 1 | IPv4  |
> +---+---+
> | 2 | UDP   |
> +---+-+--+--+---+
> | 3 | RAW | ``spec`` | ``relative`` | 1 |
> |   | |  +--+---+
> |   | |  | ``search``   | 1 |
> |   | |  +--+---+
> |   | |  | ``offset``   | 10|
> |   | |  +--+---+
> |   | |  | ``limit``| 0 |
> |   | |  +--+---+
> |   | |  | ``length``   | 3 |
> |   | |  +--+---+
> |   | |  | ``pattern``  | "foo" |
> +---+-+--+--+---+
> | 4 | RAW | ``spec`` | ``relative`` | 1 |
> |   | |  +--+---+
> |   | |  | ``search``   | 0 |
> |   | |  +--+---+
> |   | |  | ``offset``   | 20|
> |   | |  +--+---+
> |   | |  | ``limit``| 0 |
> |   | |  +--+---+
> |   | |  | ``length``   | 3 |
> |   | |  +--+---+
> |   | |  | ``pattern``  | "bar" |
> +---+-+--+--+---+
> | 5 | RAW | ``spec`` | ``relative`` | 1 |
> |   | |  +--+---+
> |   | |  | ``search``   | 0 |
> |   | |  +--+---+
> |   | |  | ``offset``   | -29   |
> |   | |  +--+---+
> |   | |  | ``limit``| 0 |
> |   | |  +--+---+
> |   | |  | ``length``   | 3 |
> |   | |  +--+---+
> |   | |  | ``pattern``  | "baz" |
> +---+-+--+--+---+
> 

Just an observation if you made 'offset' specified as an embedded RAW
field so that the offset could point at header length this would befully
generic. Although I guess its not practical as far as I know no hardware
would support the most general case.

> This translates to:
> 
> - Locate "foo" at least 10 bytes deep inside UDP payload.
> - Locate "bar" after "foo" plus 20 bytes.
> - Locate "baz" after "bar" minus 29 bytes.
> 
> Such a packet may be represented as follows (not to scale)::
> 
>  0 >= 10 B   == 20 B
>  |  |<->| |<->|
>  |  |   | |   |
>  |-|--|-|-|-|-|---|-|--|
>  | ETH | IPv4 | UDP | ... | baz | foo | . | bar |  |
>  |-|--|-|-|-|-|---|-|--|
>   | |
>   |<--->|
>   == 29 B
> 

[...]

> 
> Future evolutions
> =
> 
> - Describing dedicated testpmd commands to control and validate this API.
> 
> - A method to optimize generic flow rules with specific pattern items and
>   action types generated on the fly by PMDs. DPDK will assign negative
>   numbers to these in order to not collide with the existing types. See
>   `Negative types`_.

Great thanks. As long as we build the core layer to support this then it
looks good to me.

> 
> - Adding specific egress pattern items and actions as described in `Traffic
>   direction`_.
> 
> - Optional software fallback when PMDs are unable to handle requested flow
>   rules so applications do not have to implement their own.

This is an interesting block. Would you presumably build this using the
existing support in DPDK or propose something else?

> 
> - Ranges in addition to bit-masks. Ranges are more generic in many ways as
>   they interpre

[dpdk-dev] [RFC v2] ethdev: introduce generic flow API

2016-08-22 Thread John Fastabend
On 16-08-19 12:32 PM, Adrien Mazarguil wrote:
> This new API supersedes all the legacy filter types described in
> rte_eth_ctrl.h. It is slightly higher level and as a result relies more on
> PMDs to process and validate flow rules.
> 
> It has the following benefits:
> 
> - A unified API is easier to program for, applications do not have to be
>   written for a specific filter type which may or may not be supported by
>   the underlying device.
> 
> - The behavior of a flow rule is the same regardless of the underlying
>   device, applications do not need to be aware of hardware quirks.
> 
> - Extensible by design, API/ABI breakage should rarely occur if at all.
> 
> - Documentation is self-standing, no need to look up elsewhere.
> 
> The existing filter types will be deprecated and removed in the near
> future.
> 
> Note that it is not complete yet. This commit only provides the header
> file. The specification is provided separately, see below.
> 
> HTML version:
>  https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
> 
> PDF version:
>  https://rawgit.com/6WIND/rte_flow/master/rte_flow.pdf
> 
> Git tree:
>  https://github.com/6WIND/rte_flow
> 
> Signed-off-by: Adrien Mazarguil 
> ---

Hi Adrien,

[...]

> +
> +/**
> + * Flow rule attributes.
> + *
> + * Priorities are set on two levels: per group and per rule within groups.
> + *
> + * Lower values denote higher priority, the highest priority for both levels
> + * is 0, so that a rule with priority 0 in group 8 is always matched after a
> + * rule with priority 8 in group 0.
> + *
> + * Although optional, applications are encouraged to group similar rules as
> + * much as possible to fully take advantage of hardware capabilities
> + * (e.g. optimized matching) and work around limitations (e.g. a single
> + * pattern type possibly allowed in a given group).
> + *
> + * Group and priority levels are arbitrary and up to the application, they
> + * do not need to be contiguous nor start from 0, however the maximum number
> + * varies between devices and may be affected by existing flow rules.
> + *

Another pattern that I just want to note, I think it can be covered is
to map rules between groups.

The idea is if we build a "tunnel-endpoint" group based on a rule in the
tunnel-endpoint we might map this onto a "switch" group. In this case
the "switch" group match should depend on a rule in the
"tunnel-endpoint"  group. Meaning the TEP select the switch. I believe
this can be done with a metadata action.

My idea is to create a rule in "tunnel-endpoint" group that has a
match based on TEP address and then an action "send-to-switch group" +
"metadata set 0x1234". Then in the "switch group" add a match "metadata
eq 0x1234" this allows linking groups together.

It certainly doesn't all need to be in the first iteration of this
series but do you think this is reasonable as a TODO/future extension.
And if we standardize around group-ids the semantics should be
consistent for at least the set of NICs that support tunnel endpoint and
multiple switches.

Any thoughts?


.John


[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-10 Thread John Fastabend
On 16-08-10 06:37 AM, Adrien Mazarguil wrote:
> On Tue, Aug 09, 2016 at 02:47:44PM -0700, John Fastabend wrote:
>> On 16-08-04 06:24 AM, Adrien Mazarguil wrote:
>>> On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
> [...]
>>>> The problem is keeping priorities in order and/or possibly breaking
>>>> rules apart (e.g. you have an L2 table and an L3 table) becomes very
>>>> complex to manage at driver level. I think its easier for the
>>>> application which has some context to do this. The application "knows"
>>>> if its a router for example will likely be able to pack rules better
>>>> than a PMD will.
>>>
>>> I don't think most applications know they are L2 or L3 routers. They may not
>>> know more than the pattern provided to the PMD, which may indeed end at a L2
>>> or L3 protocol. If the application simply chooses a table based on this
>>> information, then the PMD could have easily done the same.
>>>
>>
>> But when we start thinking about encap/decap then its natural to start
>> using this interface to implement various forwarding dataplanes. And one
>> common way to organize a switch is into a TEP, router, switch
>> (mac/vlan), ACL tables, etc. In fact we see this topology starting to
>> show up in the NICs now.
>>
>> Further each table may be "managed" by a different entity. In which
>> case the software will want to manage the physical and virtual networks
>> separately.
>>
>> It doesn't make sense to me to require a software aggregator object to
>> marshal the rules into a flat table then for a PMD to split them apart
>> again.
> 
> OK, my point was mostly about handling basic cases easily and making sure
> applications do not have to bother with petty HW details when they do not
> want to, yet still get maximum performance by having the PMD make the most
> appropriate choices automatically.
> 
> You've convinced me that in many cases PMDs won't be able to optimize
> efficiently and that conscious applications will know better. The API has to
> provide the ability to do so. I think it's fine as long as it is not
> mandatory.
> 

Great. I also agree making table feature _not_ mandatory for many use
cases will be helpful. I'm just making sure we get all the use cases I
know of covered.

>>> I understand the issue is what happens when applications really want to
>>> define e.g. L2/L3/L2 rules in this specific order (or any ordering that
>>> cannot be satisfied by HW due to table constraints).
>>>
>>> By exposing tables, in such a case applications should move all rules from
>>> L2 to a L3 table themselves (assuming this is even supported) to guarantee
>>> ordering between rules, or fail to add them. This is basically what the PMD
>>> could have done, possibly in a more efficient manner in my opinion.
>>
>> I disagree with the more efficient comment :)
>>
>> If the software layer is working on L2/TEP/ACL/router layers merging
>> them just to pull them back apart is not going to be more efficient.
> 
> Moving flow rules around cannot be efficient by definition, however I think
> that attempting to describe table capabilities may be as complicated as
> describing HW bit-masking features. Applications may get it wrong as a
> result while a PMD would not make any mistake.
> 
> Your use case is valid though, if the application already groups rules, then
> sharing this information with the PMD would make sense from a performance
> standpoint.
> 
>>> Let's assume two opposite scenarios for this discussion:
>>>
>>> - App #1 is a command-line interface directly mapped to flow rules, which
>>>   basically gets slow random input from users depending on how they want to
>>>   configure their traffic. All rules differ considerably (L2, L3, L4, some
>>>   with incomplete bit-masks, etc). All in all, few but complex rules with
>>>   specific priorities.
>>>
>>
>> Agree with this and in this case the application should be behind any
>> network physical/virtual and not giving rules like encap/decap/etc. This
>> application either sits on the physical function and "owns" the hardware
>> resource or sits behind a virtual switch.
>>
>>
>>> - App #2 is something like OVS, creating and deleting a large number of very
>>>   specific (without incomplete bit-masks) and mostly identical
>>>   single-priority rules automatically and very frequently.
>>>
>>
>> Maybe for OVS but not all virtual switches a

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-10 Thread John Fastabend
On 16-08-10 04:02 AM, Adrien Mazarguil wrote:
> On Tue, Aug 09, 2016 at 02:24:26PM -0700, John Fastabend wrote:
> [...]
>>> Just an idea, could some kind of meta pattern items specifying time
>>> constraints for a rule address this issue? Say, how long (cycles/ms) the PMD
>>> may take to query/apply/delete the rule. If it cannot be guaranteed, the
>>> rule cannot be created. Applications could mantain statistic counters about
>>> failed rules to determine if performance issues are caused by the inability
>>> to create them.
>>
>> It seems a bit heavy to me to have each PMD driver implementing
>> something like this. But it would be interesting to explore probably
>> after the basic support is implemented though.
> 
> OK, let's keep this for later.
> 
> [...]
>>> Such a pattern could be generated from a separate function before feeding it
>>> to rte_flow_create(), or translated by the PMD afterwards assuming a
>>> separate meta item such as RAW_END exists to signal the end of a RAW layer.
>>> Of course processing this would be more expensive.
>>>
>>
>> Or the supported parse graph could be fetched from the hardware with the
>> values for each protocol so that the programming interface is the same.
>> The well known protocols could keep the 'enum values' in the header
>> rte_flow_item_type enum so that users would not be required to do
>> the parse graph but for new or experimental protocols we could query
>> the parse graph and get the programming pattern matching id for them.
>>
>> The normal flow would be unchanged but we don't get stuck upgrading
>> everything to add our own protocol. So the flow would be,
>>
>>  rte_get_parse_graph(graph);
>>  flow_item_proto = is_my_proto_supported(graph);
>>
>>  pattern = build_flow_match(flow_item_proto, value, mask);
>>  action = build_action();
>>  rte_flow_create(my_port, pattern, action);
>>
>> The only change to the API proposed to support this would be to allow
>> unsupported RTE_FLOW_ values to be pushed to the hardware and define
>> a range of values that are reserved for use by the parse graph discover.
>>
>> This would not have to be any more expensive.
> 
> Makes sense. Unless made entirely out of RAW items however the ensuing
> pattern would not be portable across DPDK ports, instances and versions if
> dumped in binary form for later use.
> 

Right.

> Since those would have be recognized by PMDs and applications regardless of
> the API version, I suggest making generated item types negative (enums are
> signed, let's use that).

That works then the normal positive enums maintain the list of
known/accepted protocols.

> 
> DPDK would have to maintain a list of expended values to avoid collisions
> between PMDs. A method should be provided to release them.

The 'middle layer' could have a non-public API for PMDs to get new
values call it get_flow_type_item_id() or something.

> 
> [...]
>> hmm for performance reasons building an entire graph up using RAW items
>> seems to be a bit heavy. Another alternative to the above parse graph
>> notion would be to allow users to add RAW node definitions at init time
>> and have the PMD give a ID back for those. Then the new node could be
>> used just like any other RTE_FLOW_ITEM_TYPE in a pattern.
>>
>> Something like,
>>
>>  ret_flow_item_type_foo = rte_create_raw_node(foo_raw_pattern)
>>  ret_flow_item_type_bar = rte_create_raw_node(bar_raw_pattern)
>>
>> then allow ret_flow_item_type_{foo|bar} to be used in subsequent
>> pattern matching items. And if the hardware can not support this return
>> an error from the initial rte_create_raw_node() API call.
>>
>> Do any either of those proposals sound like reasonable extensions?
> 
> Both seem acceptable in my opinion as they fit in the described API. However
> I think it would be better for this function to spit out a pattern made of
> any number of items instead of a single new item type. That way, existing
> fixed items can be reused as well, the entire pattern may even become
> portable as a result, it could be considered as a method to optimize a
> RAW pattern.
> 
> The RAW approach has the advantage of not requiring much additional code in
> the public API besides a couple of function declarations. A proper full
> blown graph would require a lot more as described in your original
> reply. Not sure which is better.
> 
> Either way they won't be part of the initial specification but it looks like
> they can be added later without affecting the basics.
> 

Right i

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-09 Thread John Fastabend
On 16-08-04 06:24 AM, Adrien Mazarguil wrote:
> On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
>> [...]
>>
>>>>>>>> The proposal looks very good.  It satisfies most of the features
>>>>>>>> supported by Chelsio NICs.  We are looking for suggestions on exposing
>>>>>>>> more additional features supported by Chelsio NICs via this API.
>>>>>>>>
>>>>>>>> Chelsio NICs have two regions in which filters can be placed -
>>>>>>>> Maskfull and Maskless regions.  As their names imply, maskfull region
>>>>>>>> can accept masks to match a range of values; whereas, maskless region
>>>>>>>> don't accept any masks and hence perform a more strict exact-matches.
>>>>>>>> Filters without masks can also be placed in maskfull region.  By
>>>>>>>> default, maskless region have higher priority over the maskfull region.
>>>>>>>> However, the priority between the two regions is configurable.
>>>>>>>
>>>>>>> I understand this configuration affects the entire device. Just to be 
>>>>>>> clear,
>>>>>>> assuming some filters are already configured, are they affected by a 
>>>>>>> change
>>>>>>> of region priority later?
>>>>>>>
>>>>>>
>>>>>> Both the regions exist at the same time in the device.  Each filter can
>>>>>> either belong to maskfull or the maskless region.
>>>>>>
>>>>>> The priority is configured at time of filter creation for every
>>>>>> individual filter and cannot be changed while the filter is still
>>>>>> active. If priority needs to be changed for a particular filter then,
>>>>>> it needs to be deleted first and re-created.
>>>>>
>>>>> Could you model this as two tables and add a table_id to the API? This
>>>>> way user space could populate the table it chooses. We would have to add
>>>>> some capabilities attributes to "learn" if tables support masks or not
>>>>> though.
>>>>>
>>>>
>>>> This approach sounds interesting.
>>>
>>> Now I understand the idea behind these tables, however from an application
>>> point of view I still think it's better if the PMD could take care of flow
>>> rules optimizations automatically. Think about it, PMDs have exactly a
>>> single kind of device they know perfectly well to manage, while applications
>>> want the best possible performance out of any device in the most generic
>>> fashion.
>>
>> The problem is keeping priorities in order and/or possibly breaking
>> rules apart (e.g. you have an L2 table and an L3 table) becomes very
>> complex to manage at driver level. I think its easier for the
>> application which has some context to do this. The application "knows"
>> if its a router for example will likely be able to pack rules better
>> than a PMD will.
> 
> I don't think most applications know they are L2 or L3 routers. They may not
> know more than the pattern provided to the PMD, which may indeed end at a L2
> or L3 protocol. If the application simply chooses a table based on this
> information, then the PMD could have easily done the same.
> 

But when we start thinking about encap/decap then its natural to start
using this interface to implement various forwarding dataplanes. And one
common way to organize a switch is into a TEP, router, switch
(mac/vlan), ACL tables, etc. In fact we see this topology starting to
show up in the NICs now.

Further each table may be "managed" by a different entity. In which
case the software will want to manage the physical and virtual networks
separately.

It doesn't make sense to me to require a software aggregator object to
marshal the rules into a flat table then for a PMD to split them apart
again.

> I understand the issue is what happens when applications really want to
> define e.g. L2/L3/L2 rules in this specific order (or any ordering that
> cannot be satisfied by HW due to table constraints).
> 
> By exposing tables, in such a case applications should move all rules from
> L2 to a L3 table themselves (assuming this is even supported) to guarantee
> ordering between rules, or fail to add them. This is basically what the PMD
> could have done, possibly in a more efficient manner in my opinion.

I disagree with the more efficient comment :)

If the sof

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-09 Thread John Fastabend
[...]

>> I'm not sure I understand 'bit granularity' here. I would say we have
>> devices now that have rather strange restrictions due to hardware
>> implementation. Going forward we should get better hardware and a lot
>> of this will go away in my view. Yes this is a long term view and
>> doesn't help the current state. The overall point you are making is
>> the sum off all these strange/odd bits in the hardware implementation
>> means capabilities queries are very difficult to guarantee. On existing
>> hardware and I think you've convinced me. Thanks ;)
> 
> Precisely. By "bit granularity" I meant that while it is fairly easy to
> report whether bit-masking is supported on protocol fields such as MAC
> addresses at all, devices may have restrictions on the possible bit-masks,
> like they may only have an effect at byte level (0xff), may not allow
> specific bits (broadcast) or there even may be a fixed set of bit-masks to
> choose from.

Yep lots of strange hardware implementation voodoo here.

> 
> [...]
>>> I understand, however I think this approach may be too low-level to express
>>> all the possible combinations. This graph would have to include possible
>>> actions for each possible pattern, all while considering that some actions
>>> are not possible with some patterns and that there are exclusive actions.
>>>
>>
>> Really? You have hardware that has dependencies between the parser and
>> the supported actions? Ugh...
> 
> Not that I know of actually, even though we cannot rule out this
> possibility.
> 
> Here are the possible cases I have in mind with existing HW:
> 
> - Too many actions specified for a single rule, even though each of them is
>   otherwise supported.

Yep most hardware will have this restriction.

> 
> - Performing several encap/decap actions. None are defined in the initial
>   specification but these are already planned.
> 

Great this is certainly needed.

> - Assuming there is a single table from the application point of view
>   (separate discussion for the other thread), some actions may only be
>   possible with the right pattern item or meta item. Asking HW to perform
>   tunnel decap may only be safe if the pattern specifically matches that
>   protocol.
> 

Yep continue in other thread.

>> If the hardware has separate tables then we shouldn't try to have the
>> PMD flatten those into a single table because we will have no way of
>> knowing how to do that. (I'll respond to the other thread on this in
>> an attempt to not get to scattered).
> 
> OK, will reply there as well.
> 
>>> Also while memory consumption is not really an issue, such a graph may be
>>> huge. It could take a while for the PMD to update it when adding a rule
>>> impacting capabilities.
>>
>> Ugh... I wouldn't suggest updating the capabilities at runtime like
>> this. But I see your point if the graph has to _guarantee_ correctness
>> how does it represent limited number of masks and other strange hw,
>> its unfortunate the hardware isn't more regular.
>>
>> You have convinced me that guaranteed correctness via capabilities
>> is going to difficult for many types of devices although not all.
> 
> I'll just add that these capabilities also depend on side effects of
> configuration performed outside the scope of this API. The way queues are
> (re)initialized or offloads configured may affect them. RSS configuration is
> the most obvious example.
> 

OK.

[...]

>>
>> My concern is this non-determinism will create performance issues in
>> the network because when a flow may or may not be offloaded this can
>> have a rather significant impact on its performance. This can make
>> debugging network wide performance miserable when at time X I get
>> performance X and then for whatever reason something degrades to
>> software and at time Y I get some performance Y << X. I suspect that
>> in general applications will bind tightly with hardware they know
>> works.
> 
> You are right, performance determinism is not taken into account at all, at
> least not yet. It should not be an issue at the beginning as long as the
> API has the ability evolve later for applications that need it.
> 
> Just an idea, could some kind of meta pattern items specifying time
> constraints for a rule address this issue? Say, how long (cycles/ms) the PMD
> may take to query/apply/delete the rule. If it cannot be guaranteed, the
> rule cannot be created. Applications could mantain statistic counters about
> failed rules to determine if performance issues are caused by the inability
> to create them.

It seems a bit heavy to me to have each PMD driver implementing
something like this. But it would be interesting to explore probably
after the basic support is implemented though.

> 
> [...]
>>> For individual points:
>>>
>>> (i) should be doable with the query API without recompiling DPDK as well,
>>> the fact API/ABI breakage must be avoided being part of the requirements. If
>>> you think there is a problem regarding this, can you

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-03 Thread John Fastabend
[...]

>> The proposal looks very good.  It satisfies most of the features
>> supported by Chelsio NICs.  We are looking for suggestions on exposing
>> more additional features supported by Chelsio NICs via this API.
>>
>> Chelsio NICs have two regions in which filters can be placed -
>> Maskfull and Maskless regions.  As their names imply, maskfull region
>> can accept masks to match a range of values; whereas, maskless region
>> don't accept any masks and hence perform a more strict exact-matches.
>> Filters without masks can also be placed in maskfull region.  By
>> default, maskless region have higher priority over the maskfull region.
>> However, the priority between the two regions is configurable.
>
> I understand this configuration affects the entire device. Just to be 
> clear,
> assuming some filters are already configured, are they affected by a 
> change
> of region priority later?
>

 Both the regions exist at the same time in the device.  Each filter can
 either belong to maskfull or the maskless region.

 The priority is configured at time of filter creation for every
 individual filter and cannot be changed while the filter is still
 active. If priority needs to be changed for a particular filter then,
 it needs to be deleted first and re-created.
>>>
>>> Could you model this as two tables and add a table_id to the API? This
>>> way user space could populate the table it chooses. We would have to add
>>> some capabilities attributes to "learn" if tables support masks or not
>>> though.
>>>
>>
>> This approach sounds interesting.
> 
> Now I understand the idea behind these tables, however from an application
> point of view I still think it's better if the PMD could take care of flow
> rules optimizations automatically. Think about it, PMDs have exactly a
> single kind of device they know perfectly well to manage, while applications
> want the best possible performance out of any device in the most generic
> fashion.

The problem is keeping priorities in order and/or possibly breaking
rules apart (e.g. you have an L2 table and an L3 table) becomes very
complex to manage at driver level. I think its easier for the
application which has some context to do this. The application "knows"
if its a router for example will likely be able to pack rules better
than a PMD will.

> 
>>> I don't see how the PMD can sort this out in any meaningful way and it
>>> has to be exposed to the application that has the intelligence to 'know'
>>> priorities between masks and non-masks filters. I'm sure you could come
>>> up with something but it would be less than ideal in many cases I would
>>> guess and we can't have the driver getting priorities wrong or we may
>>> not get the correct behavior.
> 
> It may be solved by having the PMD maintain a SW state to quickly know which
> rules are currently created and in what state the device is so basically the
> application doesn't have to perform this work.
> 
> This API allows applications to express basic needs such as "redirect
> packets matching this pattern to that queue". It must not deal with HW
> details and limitations in my opinion. If a request cannot be satisfied,
> then the rule cannot be created. No help from the application must be
> expected by PMDs, otherwise it opens the door to the same issues as the
> legacy filtering APIs.

This depends on the application and what/how it wants to manage the
device. If the application manages a pipeline with some set of tables,
then mapping this down to a single table, which then the PMD has to
unwind back to a multi-table topology to me seems like a waste.

> 
> [...]
 Unfortunately, our maskfull region is extremely small too compared to
 maskless region.

>>>
>>> To me this means a userspace application would want to pack it
>>> carefully to get the full benefit. So you need some mechanism to specify
>>> the "region" hence the above table proposal.
>>>
>>
>> Right. Makes sense.
> 
> I do not agree, applications should not be aware of it. Note this case can
> be handled differently, so that rules do not have to be moved back and forth
> between both tables. If the first created rule requires a maskfull entry,
> then all subsequent rules will be entered into that table. Otherwise no
> maskfull entry can be created as long as there is one maskless entry. When
> either table is full, no more rules may be added. Would that work for you?
> 

Its not about mask vs no mask. The devices with multiple tables that I
have don't have this mask limitations. Its about how to optimally pack
the rules and who implements that logic. I think its best done in the
application where I have the context.

Is there a way to omit the table field if the PMD is expected to do
a best effort and add the table field if the user wants explicit
control over table mgmt. This would support both models. I at least
would like to have e

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-03 Thread John Fastabend
[...]

 Considering that allowed pattern/actions combinations cannot be known in
 advance and would result in an unpractically large number of capabilities 
 to
 expose, a method is provided to validate a given rule from the current
 device configuration state without actually adding it (akin to a "dry run"
 mode).
>>>
>>> Rather than have a query/validate process why did we jump over having an
>>> intermediate representation of the capabilities? Here you state it is
>>> unpractical but we know how to represent parse graphs and the drivers
>>> could report their supported parse graph via a single query to a middle
>>> layer.
>>>
>>> This will actually reduce the msg chatter imagine many applications at
>>> init time or in boundary cases where a large set of applications come
>>> online at once and start banging on the interface all at once seems less
>>> than ideal.
> 
> Well, I also thought about a kind of graph to represent capabilities but
> feared the extra complexity would not be worth the trouble, thus settled on
> the query idea. A couple more reasons:
> 
> - Capabilities evolve at the same time as devices are configured. For
>   example, if a device supports a single RSS context, then a single rule
>   with a RSS action may be created. The graph would have to be rewritten
>   accordingly and thus queried/parsed again by the application.

The graph would not help here because this is an action
restriction not a parsing restriction. This is yet another query to see
what actions are supported and how many of each action are supported.

   get_parse_graph - report the parsable fields
   get_actions - report the supported actions and possible num of each

> 
> - Expressing capabilities at bit granularity (say, for a matching pattern
>   item mask) is complex, there is no way to simplify the representation of
>   capabilities without either losing information or making the graph more
>   complex to parse than simply providing a flow rule from an application
>   point of view.
> 

I'm not sure I understand 'bit granularity' here. I would say we have
devices now that have rather strange restrictions due to hardware
implementation. Going forward we should get better hardware and a lot
of this will go away in my view. Yes this is a long term view and
doesn't help the current state. The overall point you are making is
the sum off all these strange/odd bits in the hardware implementation
means capabilities queries are very difficult to guarantee. On existing
hardware and I think you've convinced me. Thanks ;)

> With that in mind, I am not opposed to the idea, both methods could even
> coexist, with the query function eventually evolving to become a front-end
> to a capability graph. Just remember that I am only defining the
> fundamentals for the initial implementation, i.e. how rules are expressed as
> patterns/actions and the basic functions to manage them, ideally without
> having to redefine them ever.
> 

Agreed they should be able to coexist. So I can get my capabilities
queries as a layer on top of the API here.

>> A bit more details on possible interface for capabilities query,
>>
>> One way I've used to describe these graphs from driver to software
>> stacks is to use a set of structures to build the graph. For fixed
>> graphs this could just be *.h file for programmable hardware (typically
>> coming from fw update on nics) the driver can read the parser details
>> out of firmware and render the structures.
> 
> I understand, however I think this approach may be too low-level to express
> all the possible combinations. This graph would have to include possible
> actions for each possible pattern, all while considering that some actions
> are not possible with some patterns and that there are exclusive actions.
> 

Really? You have hardware that has dependencies between the parser and
the supported actions? Ugh...

If the hardware has separate tables then we shouldn't try to have the
PMD flatten those into a single table because we will have no way of
knowing how to do that. (I'll respond to the other thread on this in
an attempt to not get to scattered).

> Also while memory consumption is not really an issue, such a graph may be
> huge. It could take a while for the PMD to update it when adding a rule
> impacting capabilities.

Ugh... I wouldn't suggest updating the capabilities at runtime like
this. But I see your point if the graph has to _guarantee_ correctness
how does it represent limited number of masks and other strange hw,
its unfortunate the hardware isn't more regular.

You have convinced me that guaranteed correctness via capabilities
is going to difficult for many types of devices although not all.

[...]

>>
>> The cost doing all this is some additional overhead at init time. But
>> building generic function over this and having a set of predefined
>> uids for well-known protocols such ip, udp, tcp, etc helps. What you
>> get for the cost is a few things that I 

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-08-02 Thread John Fastabend
On 16-07-23 02:10 PM, John Fastabend wrote:
> On 16-07-21 12:20 PM, Adrien Mazarguil wrote:
>> Hi Jerin,
>>
>> Sorry, looks like I missed your reply. Please see below.
>>
> 
> Hi Adrian,
> 
> Sorry for a bit delay but a few comments that may be worth considering.
> 
> To start with completely agree on the general problem statement and the
> nice summary of all the current models. Also good start on this.
> 
>>
>> Considering that allowed pattern/actions combinations cannot be known in
>> advance and would result in an unpractically large number of capabilities to
>> expose, a method is provided to validate a given rule from the current
>> device configuration state without actually adding it (akin to a "dry run"
>> mode).
> 
> Rather than have a query/validate process why did we jump over having an
> intermediate representation of the capabilities? Here you state it is
> unpractical but we know how to represent parse graphs and the drivers
> could report their supported parse graph via a single query to a middle
> layer.
> 
> This will actually reduce the msg chatter imagine many applications at
> init time or in boundary cases where a large set of applications come
> online at once and start banging on the interface all at once seems less
> than ideal.
> 

A bit more details on possible interface for capabilities query,

One way I've used to describe these graphs from driver to software
stacks is to use a set of structures to build the graph. For fixed
graphs this could just be *.h file for programmable hardware (typically
coming from fw update on nics) the driver can read the parser details
out of firmware and render the structures.

I've done this two ways: one is to define all the fields in their
own structures using something like,

struct field {
char *name;
u32 uid;
u32 bitwidth;
};

This gives a unique id (uid) for each field along with its
width and a user friendly name. The fields are organized into
headers via a header structure,

struct header_node {
char *name;
u32 uid;
u32 *fields;
struct parse_graph *jump;
};

Each node has a unique id and then a list of fields. Where 'fields'
is a list of uid's of fields its also easy enough to embed the field
struct in the header_node if that is simpler its really a style
question.

The 'struct parse_graph' gives the list of edges from this header node
to other header nodes. Using a parse graph structure defined

struct parse_graph {
struct field_reference ref;
__u32 jump_uid;
};

Again as a matter of style you can embed the parse graph in the header
node as I did above or do it as its own object.

The field_reference noted below gives the id of the field and the value
e.g. the tuple (ipv4.protocol, 6) then jump_uid would be the uid of TCP.

struct field_reference {
__u32 header_uid;
__u32 field_uid;
__u32 mask_type;
__u32 type;
__u8  *value;
__u8  *mask;
};

The cost doing all this is some additional overhead at init time. But
building generic function over this and having a set of predefined
uids for well-known protocols such ip, udp, tcp, etc helps. What you
get for the cost is a few things that I think are worth it. (i) Now
new protocols can be added/removed without recompiling DPDK (ii) a
software package can use the capability query to verify the required
protocols are off-loadable vs a possibly large set of test queries and
(iii) when we do the programming of the device we can provide a tuple
(table-uid, header-uid, field-uid, value, mask, priority) and the
middle layer "knowing" the above graph can verify the command so
drivers only ever see "good"  commands, (iv) finally it should be
faster in terms of cmds per second because the drivers can map the
tuple (table, header, field, priority) to a slot efficiently vs
parsing.

IMO point (iii) and (iv) will in practice make the code much simpler
because we can maintain common middle layer and not require parsing
by drivers. Making each driver simpler by abstracting into common
layer.

> Worse in my opinion it requires all drivers to write mostly duplicating
> validation code where a common layer could easily do this if every
> driver reported a common data structure representing its parse graph
> instead. The nice fallout of this initial effort upfront is the driver
> no longer needs to do error handling/checking/etc and can assume all
> rules are correct and valid. It makes driver code much simpler to
> support. And IMO at least by doing this we get some other nice benefits
> described below.
> 
> Another related question is about performance.
> 
>> Creation
>> 
>>
>> Creating a flow rule is similar to validating one, except the ru

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-25 Thread John Fastabend
On 16-07-25 04:32 AM, Rahul Lakkireddy wrote:
> Hi Adrien,
> 
> On Thursday, July 07/21/16, 2016 at 19:07:38 +0200, Adrien Mazarguil wrote:
>> Hi Rahul,
>>
>> Please see below.
>>
>> On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
>>> Hi Adrien,
>>>
>>> The proposal looks very good.  It satisfies most of the features
>>> supported by Chelsio NICs.  We are looking for suggestions on exposing
>>> more additional features supported by Chelsio NICs via this API.
>>>
>>> Chelsio NICs have two regions in which filters can be placed -
>>> Maskfull and Maskless regions.  As their names imply, maskfull region
>>> can accept masks to match a range of values; whereas, maskless region
>>> don't accept any masks and hence perform a more strict exact-matches.
>>> Filters without masks can also be placed in maskfull region.  By
>>> default, maskless region have higher priority over the maskfull region.
>>> However, the priority between the two regions is configurable.
>>
>> I understand this configuration affects the entire device. Just to be clear,
>> assuming some filters are already configured, are they affected by a change
>> of region priority later?
>>
> 
> Both the regions exist at the same time in the device.  Each filter can
> either belong to maskfull or the maskless region.
> 
> The priority is configured at time of filter creation for every
> individual filter and cannot be changed while the filter is still
> active. If priority needs to be changed for a particular filter then,
> it needs to be deleted first and re-created.

Could you model this as two tables and add a table_id to the API? This
way user space could populate the table it chooses. We would have to add
some capabilities attributes to "learn" if tables support masks or not
though.

I don't see how the PMD can sort this out in any meaningful way and it
has to be exposed to the application that has the intelligence to 'know'
priorities between masks and non-masks filters. I'm sure you could come
up with something but it would be less than ideal in many cases I would
guess and we can't have the driver getting priorities wrong or we may
not get the correct behavior.

> 
>>> Please suggest on how we can let the apps configure in which region
>>> filters must be placed and set the corresponding priority accordingly
>>> via this API.
>>
>> Okay. Applications, like customers, are always right.
>>
>> With this in mind, PMDs are not allowed to break existing flow rules, and
>> face two options when applications provide a flow specification that would
>> break an existing rule:
>>
>> - Refuse to create it (easiest).
>>
>> - Find a workaround to apply it anyway (possibly quite complicated).
>>
>> The reason you have these two regions is performance right? Otherwise I'd
>> just say put everything in the maskfull region.
>>
> 
> Unfortunately, our maskfull region is extremely small too compared to
> maskless region.
> 

To me this means a userspace application would want to pack it
carefully to get the full benefit. So you need some mechanism to specify
the "region" hence the above table proposal.

>> PMDs are allowed to rearrange existing rules or change device parameters as
>> long as existing constraints are satisfied. In my opinion it does not matter
>> which region has the highest default priority. Applications always want the
>> best performance so the first created rule should be in the most appropriate
>> region.
>>
>> If a subsequent rule requires it to be in the other region but the
>> application specified the wrong priority for this to work, then the PMD can
>> either choose to swap region priorities on the device (assuming it does not
>> affect other rules), or destroy and recreate the original rule in a way that
>> satisfies all constraints (i.e. moving conflicting rules from the maskless
>> region to the maskfull one).
>>
>> Going further, when subsequent rules get destroyed the PMD should ideally
>> move back maskfull rules back into the maskless region for better
>> performance.
>>
>> This is only a suggestion. PMDs have the right to say "no" at some point.
>>
> 
> Filter search and deletion are expensive operations and they need to be
> atomic in order to not affect existing traffic already hitting the
> filters.  Also, Chelsio hardware can support upto ~500,000 maskless
> region entries.  So, the cost of search becomes too high for the PMD
> when there are large number of filter entries present.
> 
> In my opinion, rather than PMD deciding on priority by itself, it would
> be more simpler if the apps have the flexibility to configure the
> priority between the regions by themselves?
> 

Agreed I think this will be a common problem especially as the hardware
tables get bigger and support masks where priority becomes important to
distinguish between multiple matching rules.

+1 for having a priority field.

>> More important in my opinion is to make sure applications can create a given
>> set of flow rules in any order. If ru

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-23 Thread John Fastabend
On 16-07-21 12:20 PM, Adrien Mazarguil wrote:
> Hi Jerin,
> 
> Sorry, looks like I missed your reply. Please see below.
> 

Hi Adrian,

Sorry for a bit delay but a few comments that may be worth considering.

To start with completely agree on the general problem statement and the
nice summary of all the current models. Also good start on this.

> 
> Considering that allowed pattern/actions combinations cannot be known in
> advance and would result in an unpractically large number of capabilities to
> expose, a method is provided to validate a given rule from the current
> device configuration state without actually adding it (akin to a "dry run"
> mode).

Rather than have a query/validate process why did we jump over having an
intermediate representation of the capabilities? Here you state it is
unpractical but we know how to represent parse graphs and the drivers
could report their supported parse graph via a single query to a middle
layer.

This will actually reduce the msg chatter imagine many applications at
init time or in boundary cases where a large set of applications come
online at once and start banging on the interface all at once seems less
than ideal.

Worse in my opinion it requires all drivers to write mostly duplicating
validation code where a common layer could easily do this if every
driver reported a common data structure representing its parse graph
instead. The nice fallout of this initial effort upfront is the driver
no longer needs to do error handling/checking/etc and can assume all
rules are correct and valid. It makes driver code much simpler to
support. And IMO at least by doing this we get some other nice benefits
described below.

Another related question is about performance.

> Creation
> 
> 
> Creating a flow rule is similar to validating one, except the rule is
> actually created.
> 
> ::
> 
>  struct rte_flow *
>  rte_flow_create(uint8_t port_id,
>  const struct rte_flow_pattern *pattern,
>  const struct rte_flow_actions *actions);

I gather this implies that each driver must parse the pattern/action
block and map this onto the hardware. How many rules per second can this
support? I've run into systems that expect a level of service somewhere
around 50k cmds per second. So bulking will help at the message level
but it seems like a lot of overhead to unpack the pattern/action section.

One strategy I've used in other systems that worked relatively well
is if the query for the parse graph above returns a key for each node
in the graph then a single lookup can map the key to a node. Its
unambiguous and then these operations simply become a table lookup.
So to be a bit more concrete this changes the pattern structure in
rte_flow_create() into a   tuple where the key is known
by the initial parse graph query. If you reserve a set of well-defined
key values for well known protocols like ethernet, ip, etc. then the
query model also works but the middle layer catches errors in this case
and again the driver only gets known good flows. So something like this,

  struct rte_flow_pattern {
uint32_t priority;
uint32_t key;
uint32_t value_length;
u8 *value;
  }

Also if we have multiple tables what do you think about adding a
table_id to the signature. Probably not needed in the first generation
but is likely useful for hardware with multiple tables so that it
would be,

   rte_flow_create(uint8_t port_id, uint8_t table_id, ...);

Finally one other problem we've had which would be great to address
if we are doing a rewrite of the API is adding new protocols to
already deployed DPDK stacks. This is mostly a Linux distribution
problem where you can't easily update DPDK.

In the prototype header linked in this document it seems to add new
headers requires adding a new enum in the rte_flow_item_type but there
is at least an attempt at a catch all here,

>   /**
>* Matches a string of a given length at a given offset (in bytes),
>* or anywhere in the payload of the current protocol layer
>* (including L2 header if used as the first item in the stack).
>*
>* See struct rte_flow_item_raw.
>*/
>   RTE_FLOW_ITEM_TYPE_RAW,

Actually this is a nice implementation because it works after the
previous item in the stack correct? So you can put it after "known"
variable length headers like IP. The limitation is it can't get past
undefined variable length headers. However if you use the above parse
graph reporting from the driver mechanism and the driver always reports
its largest supported graph then we don't have this issue where a new
hardware sku/ucode/etc added support for new headers but we have no
way to deploy it to existing software users without recompiling and
redeploying.

I looked at the git repo but I only saw the header definition I guess
the implementation is TBD after there is enough agreement on the
interface?

Thanks,
John