On Fri, Dec 8, 2017 at 1:06 PM, Honnappa Nagarahalli <
honnappa.nagaraha...@linaro.org> wrote:

> On 7 December 2017 at 22:36, Bill Fischofer <bill.fischo...@linaro.org>
> wrote:
> >
> >
> > On Thu, Dec 7, 2017 at 10:12 PM, Honnappa Nagarahalli
> > <honnappa.nagaraha...@linaro.org> wrote:
> >>
> >> On 7 December 2017 at 17:36, Bill Fischofer <bill.fischo...@linaro.org>
> >> wrote:
> >> >
> >> >
> >> > On Thu, Dec 7, 2017 at 3:17 PM, Honnappa Nagarahalli
> >> > <honnappa.nagaraha...@linaro.org> wrote:
> >> >>
> >> >> This experiment clearly shows the need for providing an API in ODP.
> >> >>
> >> >> On ODP2.0 implementations such an API will be simple enough (constant
> >> >> subtraction), requiring no additional storage in VLIB.
> >> >>
> >> >> Michal, can you send a PR to ODP for the API so that we can debate
> the
> >> >> feasibility of the API for Cavium/NXP platforms.
> >> >
> >> >
> >> > That's the point. An API that is tailored to a specific implementation
> >> > or
> >> > application is not what ODP is about.
> >> >
> >> How are the requirements coming to ODP APIs currently? My
> >> understanding is, it is coming from OFP and Petri's requirements.
> >> Similarly, VPP is also an application of ODP. Recently, Arm community
> >> (Arm and partners) prioritized on the open source projects that are of
> >> importance and came up with top 50 (or 100) projects. If I remember
> >> correct VPP is among top single digits (I am trying to get the exact
> >> details). So, it is an application of significant interest.
> >
> >
> > VPP is important, but what's important is for VPP to perform
> significantly
> > better on at least one ODP implementation than it does today using DPDK.
> If
> > we can't demonstrate that then there's no point to the ODP4VPP project.
> > That's not going to happen on x86 since we can assume that VPP/DPDK is
> > optimal here since VPP has been tuned to DPDK internals. So we need to
> focus
> > the performance work on Arm SoC platforms that offer significant HW
> > acceleration capabilities that VPP can exploit via ODP4VPP.
>
> VPP can exploit these capabilities through DPDK as well (may be few
> APIs are missing, but they will be available soon) as Cavium/NXP
> platforms support DPDK.
>

If the goal is to be "just as good" as DPDK then we fail because a VPP
application doesn't see or care whether DPDK or ODP is running underneath
it. The requirement is for VPP applications to run significantly (2x, 4x,
etc.) better using ODP4VPP than DPDK. That won't come by fine-tuning what's
fundamentally the same code, but rather by eliminating entire processing
steps by, for example, exploiting inline IPsec acceleration.


>
> This API is the basic API that is required for any use case. I do not
> understand why this API is not required for IPsec acceleration in NXP.
> If we store odp_packet_t in VLIB buffer, it will affect the
> performance of IPsec performance on NXP platform as well.
>

That sounds like an assumption rather than a measurement. We should let
Nikhil weigh in here about the key drivers to achieving best performance on
NXP platforms.


>
> This isn't one
> > of those. The claim is that with or without this change ODP4VPP on x86
> > performs worse than VPP/DPDK on x86.
>
> That does not mean, we do not work on increasing the performance of
> ODP4VPP on x86. This API will help catch up on the performance.
>

My point is that the best you can hope for on x86 is to be no different
than DPDK. That's a fail, so such tuning isn't germane to ODP4VPP's
success. We need to be focusing on exploiting HW acceleration paths, not
trying to equal DPDK SW paths. That's how ODP4VPP will show value.


>
> >
> > Since VPP applications don't change if ODP4VPP is in the picture or not,
> it
> > doesn't matter whether it's used on x86, so tuning ODP4VPP on x86 is at
> best
> > of secondary importance. We just need at least one Arm platform on which
> VPP
> > applications run dramatically better than without it.
>
> This is not tuning for only x86 platform, it is a tuning that would
> apply to any platform.
>

No it isn't because you don't know that other platforms have the same
brittle cache characteristics as this particular measured x86 case. And
even here I don't think we've really explored what cache tuning is
available to better exploit prefetching.


>
> >
> >>
> >>
> >> >>
> >> >>
> >> >> On 7 December 2017 at 14:08, Bill Fischofer <
> bill.fischo...@linaro.org>
> >> >> wrote:
> >> >> > On Thu, Dec 7, 2017 at 12:22 PM, Michal Mazur
> >> >> > <michal.ma...@linaro.org>
> >> >> > wrote:
> >> >> >
> >> >> >> Native VPP+DPDK plugin knows the size of rte_mbuf header and
> >> >> >> subtracts
> >> >> >> it
> >> >> >> from the vlib pointer.
> >> >> >>
> >> >> >> struct rte_mbuf *mb0 = rte_mbuf_from_vlib_buffer (b0);
> >> >> >> #define rte_mbuf_from_vlib_buffer(x) (((struct rte_mbuf *)x) - 1)
> >> >> >>
> >> >> >
> >> >> > No surprise that VPP is a DPDK application, but I thought they
> wanted
> >> >> > to
> >> >> > be
> >> >> > independent of DPDK. The problem is that ODP is never going to
> match
> >> >> > DPDK
> >> >> > at an ABI level on x86 so we can't be fixated on x86 performance
> >> >> > comparisons between ODP4VPP and VPP/DPDK.
> >> >> Any reason why we will not be able to match or exceed the
> performance?
> >> >
> >> >
> >> > It's not that ODP can't have good performance on x86, it's that DPDK
> >> > encourages apps to be very dependent on DPDK implementation details
> such
> >> > as
> >> > seen here. ODP is not going to match DPDK internals so applications
> that
> >> > exploit such internals will always see a difference.
> >> >
> >> >>
> >> >>
> >> >> What we need to do is compare
> >> >> > ODP4VPP on Arm-based SoCs vs. "native VPP" that can't take
> advantage
> >> >> > of
> >> >> > the
> >> >> > HW acceleration present on those platforms. That's how we get to
> show
> >> >> > dramatic differences. If ODP4VPP is only within a few percent (plus
> >> >> > or
> >> >> > minus) of VPP/DPDK there's no point of doing the project at all.
> >> >> >
> >> >> > So my advice would be to stash the handle in the VLIB buffer for
> now
> >> >> > and
> >> >> > focus on exploiting the native IPsec acceleration capabilities that
> >> >> > ODP
> >> >> > will permit.
> >> >> >
> >> >> >
> >> >> >> On 7 December 2017 at 19:02, Bill Fischofer
> >> >> >> <bill.fischo...@linaro.org>
> >> >> >> wrote:
> >> >> >>
> >> >> >>> Ping to others on the mailing list for opinions on this. What
> does
> >> >> >>> "native" VPP+DPDK get and how is this problem solved there?
> >> >> >>>
> >> >> >>> On Thu, Dec 7, 2017 at 11:55 AM, Michal Mazur
> >> >> >>> <michal.ma...@linaro.org>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>>> The _odp_packet_inline is common for all packets and takes up to
> >> >> >>>> two
> >> >> >>>> cachelines (it contains only offsets). Reading pointer for each
> >> >> >>>> packet from
> >> >> >>>> VLIB would require to fetch 10 million cachelines per second.
> >> >> >>>> Using prefetches does not help.
> >> >> >>>>
> >> >> >>>> On 7 December 2017 at 18:37, Bill Fischofer
> >> >> >>>> <bill.fischo...@linaro.org>
> >> >> >>>> wrote:
> >> >> >>>>
> >> >> >>>>> Yes, but _odp_packet_inline.udate is clearly not in the VLIB
> >> >> >>>>> cache
> >> >> >>>>> line
> >> >> >>>>> either, so it's a separate cache line access. Are you seeing
> this
> >> >> >>>>> difference in real runs or microbenchmarks? Why isn't the
> entire
> >> >> >>>>> VLIB being
> >> >> >>>>> prefetched at dispatch? Sequential prefetching should add
> >> >> >>>>> negligible
> >> >> >>>>> overhead.
> >> >> >>>>>
> >> >> >>>>> On Thu, Dec 7, 2017 at 11:13 AM, Michal Mazur
> >> >> >>>>> <michal.ma...@linaro.org>
> >> >> >>>>> wrote:
> >> >> >>>>>
> >> >> >>>>>> It seems that only first cache line of VLIB buffer is in L1,
> new
> >> >> >>>>>> pointer can be placed only in second cacheline.
> >> >> >>>>>> Using constant offset between user area and ODP header i get
> 11
> >> >> >>>>>> Mpps,
> >> >> >>>>>> with pointer stored in VLIB buffer only 10Mpps and with this
> new
> >> >> >>>>>> api
> >> >> >>>>>> 10.6Mpps.
> >> >> >>>>>>
> >> >> >>>>>> On 7 December 2017 at 18:04, Bill Fischofer
> >> >> >>>>>> <bill.fischo...@linaro.org
> >> >> >>>>>> > wrote:
> >> >> >>>>>>
> >> >> >>>>>>> How would calling an API be better than referencing the
> stored
> >> >> >>>>>>> data
> >> >> >>>>>>> yourself? A cache line reference is a cache line reference,
> and
> >> >> >>>>>>> presumably
> >> >> >>>>>>> the VLIB buffer is already in L1 since it's your active data.
> >> >> >>>>>>>
> >> >> >>>>>>> On Thu, Dec 7, 2017 at 10:45 AM, Michal Mazur <
> >> >> >>>>>>> michal.ma...@linaro.org> wrote:
> >> >> >>>>>>>
> >> >> >>>>>>>> Hi,
> >> >> >>>>>>>>
> >> >> >>>>>>>> For odp4vpp plugin we need a new API function which, given
> >> >> >>>>>>>> user
> >> >> >>>>>>>> area
> >> >> >>>>>>>> pointer, will return a pointer to ODP packet buffer. It is
> >> >> >>>>>>>> needed
> >> >> >>>>>>>> when
> >> >> >>>>>>>> packets processed by VPP are sent back to ODP and only a
> >> >> >>>>>>>> pointer
> >> >> >>>>>>>> to
> >> >> >>>>>>>> VLIB
> >> >> >>>>>>>> buffer data (stored inside user area) is known.
> >> >> >>>>>>>>
> >> >> >>>>>>>> I have tried to store the ODP buffer pointer in VLIB data
> but
> >> >> >>>>>>>> reading it
> >> >> >>>>>>>> for every packet lowers performance by 800kpps.
> >> >> >>>>>>>>
> >> >> >>>>>>>> For odp-dpdk implementation it can look like:
> >> >> >>>>>>>> /** @internal Inline function @param uarea @return */
> >> >> >>>>>>>> static inline odp_packet_t _odp_packet_from_user_area(void
> >> >> >>>>>>>> *uarea)
> >> >> >>>>>>>> {
> >> >> >>>>>>>>        return (odp_packet_t)((uintptr_t)uarea -
> >> >> >>>>>>>> _odp_packet_inline.udata);
> >> >> >>>>>>>> }
> >> >> >>>>>>>>
> >> >> >>>>>>>> Please let me know what you think.
> >> >> >>>>>>>>
> >> >> >>>>>>>> Thanks,
> >> >> >>>>>>>> Michal
> >> >> >>>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>
> >> >
> >> >
> >
> >
>

Reply via email to