We certainly should have this discussion, however I believe what VPP really
wants is the ability to associate a managed prefix with packets, as this is
how the VLIB buffer is currently being used.

ODP packets have the notion of a configurable headroom area that precedes
and is contiguous with the initial packet segment. The
odp_packet_push_head() API is used to extend packets into this area while
odp_packet_head() provides the address of the start of the headroom area
that is prefixed to the packet. Thus, per the ODP API spec.
odp_packet_head() + odp_packet_headroom() == odp_packet_data().

It's conceivable this relationship could be used by VPP, however at present
ODP provides no guarantee that unallocated headroom is preserved across any
ODP packet operation. The user area, by contrast, is preserved as well as
copied when needed when packets are reallocated as part of more complex
packet operations. So it's conceivable we could formalize the notion of a
managed prefix that would provide the sort of arithmetic management that
VPP wants.

However, we really need to better understand the testing environment and
the measurements that are being done here before rushing to introduce new
APIs in the hope that they will provide performance benefits. In the data
cited above, it seems this particular test configuration is unable to
achieve 10Gb line rate under the best of circumstances, which means
something more fundamental is going on that needs to be understood first.



On Sat, Dec 9, 2017 at 6:53 AM, Francois Ozog <francois.o...@linaro.org>
wrote:

> I'd like to have SoC vendors feedback on the discussion and on the
> following:
>
> This ODP user area is used by VPP to put a vlib_buffer information. VPP is
> just one of the applications that will use this simple mechanism and that
> is certainly a design pattern.
>
> A user area can exist when an odp packet is in the host, that is it has
> been received on a special queue.
> The odp_packet_t must be from a pool whose user_area size is big enough to
> hold such meta-data
>
> today we have
> void *odp_packet_user_area(odp_packet_t pkt);
> uint32_t odp_packet_user_area_size(odp_packet_t pkt);
>
> So the question is, can we have:
> odp_packet_t pkt odp_packet_from_user_area(void * user_area);
>
>
>
> 1) I wonder though if all user area calls should't be associated to the
> mempool that was configured with the desired user_area size.
> 2) implementation assessment:
>
>    - Cavium and NXP have formed odp_packet_t as physical address and
>    additional bits. So their implementation can be fairly optimized by
>    calculation if user_area is allocated consecutively to the packet buffer
>    - linux generic can have its own way of dealing with it.
>    - odp_dpdk can implement it the way Michal mentioned
>    - ODP with DDF can use calculation of odp_buffer_t the same way DPDK
>    does
>
>
> I strongly think that the proposed API is a must in the design pattern
> related to user_areas. That is, it is NOT specific to VPP.
>
> There cannot be a single way to implement way due to the ODP nature, every
> implementation have to deal with it in its own way.
>
> I guess it it is not possible under some implementations to do it by
> calculation. In that case the new API impose an implementation change that
> may be based on a hash table (because user area may not be contiguous to
> buffer). That looks bad but at least that is generic implementation.
> Alternatively, it may be said that this function can return NULL on some
> platforms. It will just prevent some applications to run on those
> platforms. Which is also a design choice that would push the upper layers
> to add the information (odp_packet_t in our case) in their meta data
> (blib_buffer in the case of VPP) to be compatible.
>
> FF
>
> On 8 December 2017 at 22:55, Bill Fischofer <bill.fischo...@linaro.org>
> wrote:
>
>> On Fri, Dec 8, 2017 at 2:49 PM, Honnappa Nagarahalli <
>> honnappa.nagaraha...@linaro.org> wrote:
>>
>> > On 8 December 2017 at 13:40, Bill Fischofer <bill.fischo...@linaro.org>
>> > wrote:
>> > >
>> > >
>> > > On Fri, Dec 8, 2017 at 1:06 PM, Honnappa Nagarahalli
>> > > <honnappa.nagaraha...@linaro.org> wrote:
>> > >>
>> > >> On 7 December 2017 at 22:36, Bill Fischofer <
>> bill.fischo...@linaro.org>
>> > >> wrote:
>> > >> >
>> > >> >
>> > >> > On Thu, Dec 7, 2017 at 10:12 PM, Honnappa Nagarahalli
>> > >> > <honnappa.nagaraha...@linaro.org> wrote:
>> > >> >>
>> > >> >> On 7 December 2017 at 17:36, Bill Fischofer <
>> > bill.fischo...@linaro.org>
>> > >> >> wrote:
>> > >> >> >
>> > >> >> >
>> > >> >> > On Thu, Dec 7, 2017 at 3:17 PM, Honnappa Nagarahalli
>> > >> >> > <honnappa.nagaraha...@linaro.org> wrote:
>> > >> >> >>
>> > >> >> >> This experiment clearly shows the need for providing an API in
>> > ODP.
>> > >> >> >>
>> > >> >> >> On ODP2.0 implementations such an API will be simple enough
>> > >> >> >> (constant
>> > >> >> >> subtraction), requiring no additional storage in VLIB.
>> > >> >> >>
>> > >> >> >> Michal, can you send a PR to ODP for the API so that we can
>> debate
>> > >> >> >> the
>> > >> >> >> feasibility of the API for Cavium/NXP platforms.
>> > >> >> >
>> > >> >> >
>> > >> >> > That's the point. An API that is tailored to a specific
>> > >> >> > implementation
>> > >> >> > or
>> > >> >> > application is not what ODP is about.
>> > >> >> >
>> > >> >> How are the requirements coming to ODP APIs currently? My
>> > >> >> understanding is, it is coming from OFP and Petri's requirements.
>> > >> >> Similarly, VPP is also an application of ODP. Recently, Arm
>> community
>> > >> >> (Arm and partners) prioritized on the open source projects that
>> are
>> > of
>> > >> >> importance and came up with top 50 (or 100) projects. If I
>> remember
>> > >> >> correct VPP is among top single digits (I am trying to get the
>> exact
>> > >> >> details). So, it is an application of significant interest.
>> > >> >
>> > >> >
>> > >> > VPP is important, but what's important is for VPP to perform
>> > >> > significantly
>> > >> > better on at least one ODP implementation than it does today using
>> > DPDK.
>> > >> > If
>> > >> > we can't demonstrate that then there's no point to the ODP4VPP
>> > project.
>> > >> > That's not going to happen on x86 since we can assume that
>> VPP/DPDK is
>> > >> > optimal here since VPP has been tuned to DPDK internals. So we
>> need to
>> > >> > focus
>> > >> > the performance work on Arm SoC platforms that offer significant HW
>> > >> > acceleration capabilities that VPP can exploit via ODP4VPP.
>> > >>
>> > >> VPP can exploit these capabilities through DPDK as well (may be few
>> > >> APIs are missing, but they will be available soon) as Cavium/NXP
>> > >> platforms support DPDK.
>> > >
>> > >
>> > > If the goal is to be "just as good" as DPDK then we fail because a VPP
>> > > application doesn't see or care whether DPDK or ODP is running
>> underneath
>> > > it. The requirement is for VPP applications to run significantly (2x,
>> 4x,
>> > > etc.) better using ODP4VPP than DPDK. That won't come by fine-tuning
>> > what's
>> > > fundamentally the same code, but rather by eliminating entire
>> processing
>> > > steps by, for example, exploiting inline IPsec acceleration.
>> > >
>> > The point I am trying to make is, VPP can exploit inline IPsec
>> > acceleration through DPDK as well (if those APIs are not available in
>> > DPDK, they will be available soon). So, what use cases we will look at
>> > that point? We need to be looking at all the use cases and we need to
>> > be better at all them.
>> >
>>
>> We know that DPDK is "chasing" ODP in this and other areas. My assumption,
>> however, is that Intel will never allow DPDK on Arm to be better than DPDK
>> on x86, for obvious marketing reasons. That's why we need to "push the
>> envelope" with ODP on Arm since we're under no such constraints.
>>
>>
>> >
>> > >>
>> > >>
>> > >> This API is the basic API that is required for any use case. I do not
>> > >> understand why this API is not required for IPsec acceleration in
>> NXP.
>> > >> If we store odp_packet_t in VLIB buffer, it will affect the
>> > >> performance of IPsec performance on NXP platform as well.
>> > >
>> > >
>> > > That sounds like an assumption rather than a measurement. We should
>> let
>> > > Nikhil weigh in here about the key drivers to achieving best
>> performance
>> > on
>> > > NXP platforms.
>> > >
>> > Well, not exactly assumption, based on working on similar
>> > optimizations, we all have done enough cache line related
>> > optimizations in Linux-Generic now. We can do it again, Sachin has the
>> > code as well.
>> >
>>
>> That would be good to know.
>>
>>
>> >
>> > >>
>> > >>
>> > >> This isn't one
>> > >> > of those. The claim is that with or without this change ODP4VPP on
>> x86
>> > >> > performs worse than VPP/DPDK on x86.
>> > >>
>> > >> That does not mean, we do not work on increasing the performance of
>> > >> ODP4VPP on x86. This API will help catch up on the performance.
>> > >
>> > >
>> > > My point is that the best you can hope for on x86 is to be no
>> different
>> > than
>> > > DPDK. That's a fail, so such tuning isn't germane to ODP4VPP's
>> success.
>> > We
>> > > need to be focusing on exploiting HW acceleration paths, not trying to
>> > equal
>> > > DPDK SW paths. That's how ODP4VPP will show value.
>> >
>> > Not all Arm platforms have accelerators. There are server CPUs which
>> > are just CPU cores without any accelerators. On such platforms, it is
>> > only software, and every optimization becomes important. These
>> > optimizations are important for platforms with accelerators as well,
>> > why waste cycles?
>> >
>> > odp-dpdk is being used temporarily. Once ODP2.0 is available, the same
>> > solution applies for 2.0 as well.
>> >
>> > So, we need to take out DPDK and x86 from this discussion and have ODP
>> > and Arm to keep the discussion meaningful.
>> >
>>
>> Agreed, but we're not talking about ODP applications here. We're talking
>> about a specific DPDK application (VPP) that we're trying to port to ODP
>> with minimal change. That may or may not be feasible as a pure SW
>> exercise,
>> which is why we need to focus on the accelerated paths rather than trying
>> to turn ODP into a DPDK wannabe.
>>
>> The case for optimal ODP SW implementations is to promote ODP
>> applications,
>> which is not the case here since VPP is not an ODP application. It's
>> actually a DPDK application even though they claim not to be. If we were
>> to
>> turn it into an ODP application it would likely fare better, but that's
>> beyond the scope of this project.
>>
>>
>> >
>> > >
>> > >>
>> > >>
>> > >> >
>> > >> > Since VPP applications don't change if ODP4VPP is in the picture or
>> > not,
>> > >> > it
>> > >> > doesn't matter whether it's used on x86, so tuning ODP4VPP on x86
>> is
>> > at
>> > >> > best
>> > >> > of secondary importance. We just need at least one Arm platform on
>> > which
>> > >> > VPP
>> > >> > applications run dramatically better than without it.
>> > >>
>> > >> This is not tuning for only x86 platform, it is a tuning that would
>> > >> apply to any platform.
>> > >
>> > >
>> > > No it isn't because you don't know that other platforms have the same
>> > > brittle cache characteristics as this particular measured x86 case.
>> And
>> > even
>> > > here I don't think we've really explored what cache tuning is
>> available
>> > to
>> > > better exploit prefetching.
>> >
>> > One of the defined goals of ODP is to perform well on x86 as well.
>> > There are multiple ways to solve the problem and this is one way which
>> > is showing better performance results. And the implementation that is
>> > being used in this test applies to ODP 2.0. This test is also proving
>> > that, implementation can do a better job at getting the handle to the
>> > packet.
>> >
>>
>> Again, the purpose of good ODP performance on x86 is to support ODP
>> applications. A well-written ODP application should perform as well on x86
>> as it would were it written as a DPDK application and dramatically better
>> on Arm platforms that have HW acceleration capabilities that have no x86
>> equivalents. That's the whole justification for the ODP project.
>>
>>
>> >
>> > >
>> > >>
>> > >>
>> > >> >
>> > >> >>
>> > >> >>
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> On 7 December 2017 at 14:08, Bill Fischofer
>> > >> >> >> <bill.fischo...@linaro.org>
>> > >> >> >> wrote:
>> > >> >> >> > On Thu, Dec 7, 2017 at 12:22 PM, Michal Mazur
>> > >> >> >> > <michal.ma...@linaro.org>
>> > >> >> >> > wrote:
>> > >> >> >> >
>> > >> >> >> >> Native VPP+DPDK plugin knows the size of rte_mbuf header and
>> > >> >> >> >> subtracts
>> > >> >> >> >> it
>> > >> >> >> >> from the vlib pointer.
>> > >> >> >> >>
>> > >> >> >> >> struct rte_mbuf *mb0 = rte_mbuf_from_vlib_buffer (b0);
>> > >> >> >> >> #define rte_mbuf_from_vlib_buffer(x) (((struct rte_mbuf
>> *)x) -
>> > 1)
>> > >> >> >> >>
>> > >> >> >> >
>> > >> >> >> > No surprise that VPP is a DPDK application, but I thought
>> they
>> > >> >> >> > wanted
>> > >> >> >> > to
>> > >> >> >> > be
>> > >> >> >> > independent of DPDK. The problem is that ODP is never going
>> to
>> > >> >> >> > match
>> > >> >> >> > DPDK
>> > >> >> >> > at an ABI level on x86 so we can't be fixated on x86
>> performance
>> > >> >> >> > comparisons between ODP4VPP and VPP/DPDK.
>> > >> >> >> Any reason why we will not be able to match or exceed the
>> > >> >> >> performance?
>> > >> >> >
>> > >> >> >
>> > >> >> > It's not that ODP can't have good performance on x86, it's that
>> > DPDK
>> > >> >> > encourages apps to be very dependent on DPDK implementation
>> details
>> > >> >> > such
>> > >> >> > as
>> > >> >> > seen here. ODP is not going to match DPDK internals so
>> applications
>> > >> >> > that
>> > >> >> > exploit such internals will always see a difference.
>> > >> >> >
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> What we need to do is compare
>> > >> >> >> > ODP4VPP on Arm-based SoCs vs. "native VPP" that can't take
>> > >> >> >> > advantage
>> > >> >> >> > of
>> > >> >> >> > the
>> > >> >> >> > HW acceleration present on those platforms. That's how we
>> get to
>> > >> >> >> > show
>> > >> >> >> > dramatic differences. If ODP4VPP is only within a few percent
>> > >> >> >> > (plus
>> > >> >> >> > or
>> > >> >> >> > minus) of VPP/DPDK there's no point of doing the project at
>> all.
>> > >> >> >> >
>> > >> >> >> > So my advice would be to stash the handle in the VLIB buffer
>> for
>> > >> >> >> > now
>> > >> >> >> > and
>> > >> >> >> > focus on exploiting the native IPsec acceleration
>> capabilities
>> > >> >> >> > that
>> > >> >> >> > ODP
>> > >> >> >> > will permit.
>> > >> >> >> >
>> > >> >> >> >
>> > >> >> >> >> On 7 December 2017 at 19:02, Bill Fischofer
>> > >> >> >> >> <bill.fischo...@linaro.org>
>> > >> >> >> >> wrote:
>> > >> >> >> >>
>> > >> >> >> >>> Ping to others on the mailing list for opinions on this.
>> What
>> > >> >> >> >>> does
>> > >> >> >> >>> "native" VPP+DPDK get and how is this problem solved there?
>> > >> >> >> >>>
>> > >> >> >> >>> On Thu, Dec 7, 2017 at 11:55 AM, Michal Mazur
>> > >> >> >> >>> <michal.ma...@linaro.org>
>> > >> >> >> >>> wrote:
>> > >> >> >> >>>
>> > >> >> >> >>>> The _odp_packet_inline is common for all packets and
>> takes up
>> > >> >> >> >>>> to
>> > >> >> >> >>>> two
>> > >> >> >> >>>> cachelines (it contains only offsets). Reading pointer for
>> > each
>> > >> >> >> >>>> packet from
>> > >> >> >> >>>> VLIB would require to fetch 10 million cachelines per
>> second.
>> > >> >> >> >>>> Using prefetches does not help.
>> > >> >> >> >>>>
>> > >> >> >> >>>> On 7 December 2017 at 18:37, Bill Fischofer
>> > >> >> >> >>>> <bill.fischo...@linaro.org>
>> > >> >> >> >>>> wrote:
>> > >> >> >> >>>>
>> > >> >> >> >>>>> Yes, but _odp_packet_inline.udate is clearly not in the
>> VLIB
>> > >> >> >> >>>>> cache
>> > >> >> >> >>>>> line
>> > >> >> >> >>>>> either, so it's a separate cache line access. Are you
>> seeing
>> > >> >> >> >>>>> this
>> > >> >> >> >>>>> difference in real runs or microbenchmarks? Why isn't the
>> > >> >> >> >>>>> entire
>> > >> >> >> >>>>> VLIB being
>> > >> >> >> >>>>> prefetched at dispatch? Sequential prefetching should add
>> > >> >> >> >>>>> negligible
>> > >> >> >> >>>>> overhead.
>> > >> >> >> >>>>>
>> > >> >> >> >>>>> On Thu, Dec 7, 2017 at 11:13 AM, Michal Mazur
>> > >> >> >> >>>>> <michal.ma...@linaro.org>
>> > >> >> >> >>>>> wrote:
>> > >> >> >> >>>>>
>> > >> >> >> >>>>>> It seems that only first cache line of VLIB buffer is in
>> > L1,
>> > >> >> >> >>>>>> new
>> > >> >> >> >>>>>> pointer can be placed only in second cacheline.
>> > >> >> >> >>>>>> Using constant offset between user area and ODP header i
>> > get
>> > >> >> >> >>>>>> 11
>> > >> >> >> >>>>>> Mpps,
>> > >> >> >> >>>>>> with pointer stored in VLIB buffer only 10Mpps and with
>> > this
>> > >> >> >> >>>>>> new
>> > >> >> >> >>>>>> api
>> > >> >> >> >>>>>> 10.6Mpps.
>> > >> >> >> >>>>>>
>> > >> >> >> >>>>>> On 7 December 2017 at 18:04, Bill Fischofer
>> > >> >> >> >>>>>> <bill.fischo...@linaro.org
>> > >> >> >> >>>>>> > wrote:
>> > >> >> >> >>>>>>
>> > >> >> >> >>>>>>> How would calling an API be better than referencing the
>> > >> >> >> >>>>>>> stored
>> > >> >> >> >>>>>>> data
>> > >> >> >> >>>>>>> yourself? A cache line reference is a cache line
>> > reference,
>> > >> >> >> >>>>>>> and
>> > >> >> >> >>>>>>> presumably
>> > >> >> >> >>>>>>> the VLIB buffer is already in L1 since it's your active
>> > >> >> >> >>>>>>> data.
>> > >> >> >> >>>>>>>
>> > >> >> >> >>>>>>> On Thu, Dec 7, 2017 at 10:45 AM, Michal Mazur <
>> > >> >> >> >>>>>>> michal.ma...@linaro.org> wrote:
>> > >> >> >> >>>>>>>
>> > >> >> >> >>>>>>>> Hi,
>> > >> >> >> >>>>>>>>
>> > >> >> >> >>>>>>>> For odp4vpp plugin we need a new API function which,
>> > given
>> > >> >> >> >>>>>>>> user
>> > >> >> >> >>>>>>>> area
>> > >> >> >> >>>>>>>> pointer, will return a pointer to ODP packet buffer.
>> It
>> > is
>> > >> >> >> >>>>>>>> needed
>> > >> >> >> >>>>>>>> when
>> > >> >> >> >>>>>>>> packets processed by VPP are sent back to ODP and
>> only a
>> > >> >> >> >>>>>>>> pointer
>> > >> >> >> >>>>>>>> to
>> > >> >> >> >>>>>>>> VLIB
>> > >> >> >> >>>>>>>> buffer data (stored inside user area) is known.
>> > >> >> >> >>>>>>>>
>> > >> >> >> >>>>>>>> I have tried to store the ODP buffer pointer in VLIB
>> data
>> > >> >> >> >>>>>>>> but
>> > >> >> >> >>>>>>>> reading it
>> > >> >> >> >>>>>>>> for every packet lowers performance by 800kpps.
>> > >> >> >> >>>>>>>>
>> > >> >> >> >>>>>>>> For odp-dpdk implementation it can look like:
>> > >> >> >> >>>>>>>> /** @internal Inline function @param uarea @return */
>> > >> >> >> >>>>>>>> static inline odp_packet_t _odp_packet_from_user_area(
>> > void
>> > >> >> >> >>>>>>>> *uarea)
>> > >> >> >> >>>>>>>> {
>> > >> >> >> >>>>>>>>        return (odp_packet_t)((uintptr_t)uarea -
>> > >> >> >> >>>>>>>> _odp_packet_inline.udata);
>> > >> >> >> >>>>>>>> }
>> > >> >> >> >>>>>>>>
>> > >> >> >> >>>>>>>> Please let me know what you think.
>> > >> >> >> >>>>>>>>
>> > >> >> >> >>>>>>>> Thanks,
>> > >> >> >> >>>>>>>> Michal
>> > >> >> >> >>>>>>>>
>> > >> >> >> >>>>>>>
>> > >> >> >> >>>>>>>
>> > >> >> >> >>>>>>
>> > >> >> >> >>>>>
>> > >> >> >> >>>>
>> > >> >> >> >>>
>> > >> >> >> >>
>> > >> >> >
>> > >> >> >
>> > >> >
>> > >> >
>> > >
>> > >
>> >
>>
>
>
>
> --
> [image: Linaro] <http://www.linaro.org/>
> François-Frédéric Ozog | *Director Linaro Networking Group*
> T: +33.67221.6485
> francois.o...@linaro.org | Skype: ffozog
>
>

Reply via email to