On Fri, Dec 8, 2017 at 2:49 PM, Honnappa Nagarahalli <
honnappa.nagaraha...@linaro.org> wrote:

> On 8 December 2017 at 13:40, Bill Fischofer <bill.fischo...@linaro.org>
> wrote:
> >
> >
> > On Fri, Dec 8, 2017 at 1:06 PM, Honnappa Nagarahalli
> > <honnappa.nagaraha...@linaro.org> wrote:
> >>
> >> On 7 December 2017 at 22:36, Bill Fischofer <bill.fischo...@linaro.org>
> >> wrote:
> >> >
> >> >
> >> > On Thu, Dec 7, 2017 at 10:12 PM, Honnappa Nagarahalli
> >> > <honnappa.nagaraha...@linaro.org> wrote:
> >> >>
> >> >> On 7 December 2017 at 17:36, Bill Fischofer <
> bill.fischo...@linaro.org>
> >> >> wrote:
> >> >> >
> >> >> >
> >> >> > On Thu, Dec 7, 2017 at 3:17 PM, Honnappa Nagarahalli
> >> >> > <honnappa.nagaraha...@linaro.org> wrote:
> >> >> >>
> >> >> >> This experiment clearly shows the need for providing an API in
> ODP.
> >> >> >>
> >> >> >> On ODP2.0 implementations such an API will be simple enough
> >> >> >> (constant
> >> >> >> subtraction), requiring no additional storage in VLIB.
> >> >> >>
> >> >> >> Michal, can you send a PR to ODP for the API so that we can debate
> >> >> >> the
> >> >> >> feasibility of the API for Cavium/NXP platforms.
> >> >> >
> >> >> >
> >> >> > That's the point. An API that is tailored to a specific
> >> >> > implementation
> >> >> > or
> >> >> > application is not what ODP is about.
> >> >> >
> >> >> How are the requirements coming to ODP APIs currently? My
> >> >> understanding is, it is coming from OFP and Petri's requirements.
> >> >> Similarly, VPP is also an application of ODP. Recently, Arm community
> >> >> (Arm and partners) prioritized on the open source projects that are
> of
> >> >> importance and came up with top 50 (or 100) projects. If I remember
> >> >> correct VPP is among top single digits (I am trying to get the exact
> >> >> details). So, it is an application of significant interest.
> >> >
> >> >
> >> > VPP is important, but what's important is for VPP to perform
> >> > significantly
> >> > better on at least one ODP implementation than it does today using
> DPDK.
> >> > If
> >> > we can't demonstrate that then there's no point to the ODP4VPP
> project.
> >> > That's not going to happen on x86 since we can assume that VPP/DPDK is
> >> > optimal here since VPP has been tuned to DPDK internals. So we need to
> >> > focus
> >> > the performance work on Arm SoC platforms that offer significant HW
> >> > acceleration capabilities that VPP can exploit via ODP4VPP.
> >>
> >> VPP can exploit these capabilities through DPDK as well (may be few
> >> APIs are missing, but they will be available soon) as Cavium/NXP
> >> platforms support DPDK.
> >
> >
> > If the goal is to be "just as good" as DPDK then we fail because a VPP
> > application doesn't see or care whether DPDK or ODP is running underneath
> > it. The requirement is for VPP applications to run significantly (2x, 4x,
> > etc.) better using ODP4VPP than DPDK. That won't come by fine-tuning
> what's
> > fundamentally the same code, but rather by eliminating entire processing
> > steps by, for example, exploiting inline IPsec acceleration.
> >
> The point I am trying to make is, VPP can exploit inline IPsec
> acceleration through DPDK as well (if those APIs are not available in
> DPDK, they will be available soon). So, what use cases we will look at
> that point? We need to be looking at all the use cases and we need to
> be better at all them.
>

We know that DPDK is "chasing" ODP in this and other areas. My assumption,
however, is that Intel will never allow DPDK on Arm to be better than DPDK
on x86, for obvious marketing reasons. That's why we need to "push the
envelope" with ODP on Arm since we're under no such constraints.


>
> >>
> >>
> >> This API is the basic API that is required for any use case. I do not
> >> understand why this API is not required for IPsec acceleration in NXP.
> >> If we store odp_packet_t in VLIB buffer, it will affect the
> >> performance of IPsec performance on NXP platform as well.
> >
> >
> > That sounds like an assumption rather than a measurement. We should let
> > Nikhil weigh in here about the key drivers to achieving best performance
> on
> > NXP platforms.
> >
> Well, not exactly assumption, based on working on similar
> optimizations, we all have done enough cache line related
> optimizations in Linux-Generic now. We can do it again, Sachin has the
> code as well.
>

That would be good to know.


>
> >>
> >>
> >> This isn't one
> >> > of those. The claim is that with or without this change ODP4VPP on x86
> >> > performs worse than VPP/DPDK on x86.
> >>
> >> That does not mean, we do not work on increasing the performance of
> >> ODP4VPP on x86. This API will help catch up on the performance.
> >
> >
> > My point is that the best you can hope for on x86 is to be no different
> than
> > DPDK. That's a fail, so such tuning isn't germane to ODP4VPP's success.
> We
> > need to be focusing on exploiting HW acceleration paths, not trying to
> equal
> > DPDK SW paths. That's how ODP4VPP will show value.
>
> Not all Arm platforms have accelerators. There are server CPUs which
> are just CPU cores without any accelerators. On such platforms, it is
> only software, and every optimization becomes important. These
> optimizations are important for platforms with accelerators as well,
> why waste cycles?
>
> odp-dpdk is being used temporarily. Once ODP2.0 is available, the same
> solution applies for 2.0 as well.
>
> So, we need to take out DPDK and x86 from this discussion and have ODP
> and Arm to keep the discussion meaningful.
>

Agreed, but we're not talking about ODP applications here. We're talking
about a specific DPDK application (VPP) that we're trying to port to ODP
with minimal change. That may or may not be feasible as a pure SW exercise,
which is why we need to focus on the accelerated paths rather than trying
to turn ODP into a DPDK wannabe.

The case for optimal ODP SW implementations is to promote ODP applications,
which is not the case here since VPP is not an ODP application. It's
actually a DPDK application even though they claim not to be. If we were to
turn it into an ODP application it would likely fare better, but that's
beyond the scope of this project.


>
> >
> >>
> >>
> >> >
> >> > Since VPP applications don't change if ODP4VPP is in the picture or
> not,
> >> > it
> >> > doesn't matter whether it's used on x86, so tuning ODP4VPP on x86 is
> at
> >> > best
> >> > of secondary importance. We just need at least one Arm platform on
> which
> >> > VPP
> >> > applications run dramatically better than without it.
> >>
> >> This is not tuning for only x86 platform, it is a tuning that would
> >> apply to any platform.
> >
> >
> > No it isn't because you don't know that other platforms have the same
> > brittle cache characteristics as this particular measured x86 case. And
> even
> > here I don't think we've really explored what cache tuning is available
> to
> > better exploit prefetching.
>
> One of the defined goals of ODP is to perform well on x86 as well.
> There are multiple ways to solve the problem and this is one way which
> is showing better performance results. And the implementation that is
> being used in this test applies to ODP 2.0. This test is also proving
> that, implementation can do a better job at getting the handle to the
> packet.
>

Again, the purpose of good ODP performance on x86 is to support ODP
applications. A well-written ODP application should perform as well on x86
as it would were it written as a DPDK application and dramatically better
on Arm platforms that have HW acceleration capabilities that have no x86
equivalents. That's the whole justification for the ODP project.


>
> >
> >>
> >>
> >> >
> >> >>
> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 7 December 2017 at 14:08, Bill Fischofer
> >> >> >> <bill.fischo...@linaro.org>
> >> >> >> wrote:
> >> >> >> > On Thu, Dec 7, 2017 at 12:22 PM, Michal Mazur
> >> >> >> > <michal.ma...@linaro.org>
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Native VPP+DPDK plugin knows the size of rte_mbuf header and
> >> >> >> >> subtracts
> >> >> >> >> it
> >> >> >> >> from the vlib pointer.
> >> >> >> >>
> >> >> >> >> struct rte_mbuf *mb0 = rte_mbuf_from_vlib_buffer (b0);
> >> >> >> >> #define rte_mbuf_from_vlib_buffer(x) (((struct rte_mbuf *)x) -
> 1)
> >> >> >> >>
> >> >> >> >
> >> >> >> > No surprise that VPP is a DPDK application, but I thought they
> >> >> >> > wanted
> >> >> >> > to
> >> >> >> > be
> >> >> >> > independent of DPDK. The problem is that ODP is never going to
> >> >> >> > match
> >> >> >> > DPDK
> >> >> >> > at an ABI level on x86 so we can't be fixated on x86 performance
> >> >> >> > comparisons between ODP4VPP and VPP/DPDK.
> >> >> >> Any reason why we will not be able to match or exceed the
> >> >> >> performance?
> >> >> >
> >> >> >
> >> >> > It's not that ODP can't have good performance on x86, it's that
> DPDK
> >> >> > encourages apps to be very dependent on DPDK implementation details
> >> >> > such
> >> >> > as
> >> >> > seen here. ODP is not going to match DPDK internals so applications
> >> >> > that
> >> >> > exploit such internals will always see a difference.
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> What we need to do is compare
> >> >> >> > ODP4VPP on Arm-based SoCs vs. "native VPP" that can't take
> >> >> >> > advantage
> >> >> >> > of
> >> >> >> > the
> >> >> >> > HW acceleration present on those platforms. That's how we get to
> >> >> >> > show
> >> >> >> > dramatic differences. If ODP4VPP is only within a few percent
> >> >> >> > (plus
> >> >> >> > or
> >> >> >> > minus) of VPP/DPDK there's no point of doing the project at all.
> >> >> >> >
> >> >> >> > So my advice would be to stash the handle in the VLIB buffer for
> >> >> >> > now
> >> >> >> > and
> >> >> >> > focus on exploiting the native IPsec acceleration capabilities
> >> >> >> > that
> >> >> >> > ODP
> >> >> >> > will permit.
> >> >> >> >
> >> >> >> >
> >> >> >> >> On 7 December 2017 at 19:02, Bill Fischofer
> >> >> >> >> <bill.fischo...@linaro.org>
> >> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >>> Ping to others on the mailing list for opinions on this. What
> >> >> >> >>> does
> >> >> >> >>> "native" VPP+DPDK get and how is this problem solved there?
> >> >> >> >>>
> >> >> >> >>> On Thu, Dec 7, 2017 at 11:55 AM, Michal Mazur
> >> >> >> >>> <michal.ma...@linaro.org>
> >> >> >> >>> wrote:
> >> >> >> >>>
> >> >> >> >>>> The _odp_packet_inline is common for all packets and takes up
> >> >> >> >>>> to
> >> >> >> >>>> two
> >> >> >> >>>> cachelines (it contains only offsets). Reading pointer for
> each
> >> >> >> >>>> packet from
> >> >> >> >>>> VLIB would require to fetch 10 million cachelines per second.
> >> >> >> >>>> Using prefetches does not help.
> >> >> >> >>>>
> >> >> >> >>>> On 7 December 2017 at 18:37, Bill Fischofer
> >> >> >> >>>> <bill.fischo...@linaro.org>
> >> >> >> >>>> wrote:
> >> >> >> >>>>
> >> >> >> >>>>> Yes, but _odp_packet_inline.udate is clearly not in the VLIB
> >> >> >> >>>>> cache
> >> >> >> >>>>> line
> >> >> >> >>>>> either, so it's a separate cache line access. Are you seeing
> >> >> >> >>>>> this
> >> >> >> >>>>> difference in real runs or microbenchmarks? Why isn't the
> >> >> >> >>>>> entire
> >> >> >> >>>>> VLIB being
> >> >> >> >>>>> prefetched at dispatch? Sequential prefetching should add
> >> >> >> >>>>> negligible
> >> >> >> >>>>> overhead.
> >> >> >> >>>>>
> >> >> >> >>>>> On Thu, Dec 7, 2017 at 11:13 AM, Michal Mazur
> >> >> >> >>>>> <michal.ma...@linaro.org>
> >> >> >> >>>>> wrote:
> >> >> >> >>>>>
> >> >> >> >>>>>> It seems that only first cache line of VLIB buffer is in
> L1,
> >> >> >> >>>>>> new
> >> >> >> >>>>>> pointer can be placed only in second cacheline.
> >> >> >> >>>>>> Using constant offset between user area and ODP header i
> get
> >> >> >> >>>>>> 11
> >> >> >> >>>>>> Mpps,
> >> >> >> >>>>>> with pointer stored in VLIB buffer only 10Mpps and with
> this
> >> >> >> >>>>>> new
> >> >> >> >>>>>> api
> >> >> >> >>>>>> 10.6Mpps.
> >> >> >> >>>>>>
> >> >> >> >>>>>> On 7 December 2017 at 18:04, Bill Fischofer
> >> >> >> >>>>>> <bill.fischo...@linaro.org
> >> >> >> >>>>>> > wrote:
> >> >> >> >>>>>>
> >> >> >> >>>>>>> How would calling an API be better than referencing the
> >> >> >> >>>>>>> stored
> >> >> >> >>>>>>> data
> >> >> >> >>>>>>> yourself? A cache line reference is a cache line
> reference,
> >> >> >> >>>>>>> and
> >> >> >> >>>>>>> presumably
> >> >> >> >>>>>>> the VLIB buffer is already in L1 since it's your active
> >> >> >> >>>>>>> data.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> On Thu, Dec 7, 2017 at 10:45 AM, Michal Mazur <
> >> >> >> >>>>>>> michal.ma...@linaro.org> wrote:
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>> Hi,
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> For odp4vpp plugin we need a new API function which,
> given
> >> >> >> >>>>>>>> user
> >> >> >> >>>>>>>> area
> >> >> >> >>>>>>>> pointer, will return a pointer to ODP packet buffer. It
> is
> >> >> >> >>>>>>>> needed
> >> >> >> >>>>>>>> when
> >> >> >> >>>>>>>> packets processed by VPP are sent back to ODP and only a
> >> >> >> >>>>>>>> pointer
> >> >> >> >>>>>>>> to
> >> >> >> >>>>>>>> VLIB
> >> >> >> >>>>>>>> buffer data (stored inside user area) is known.
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> I have tried to store the ODP buffer pointer in VLIB data
> >> >> >> >>>>>>>> but
> >> >> >> >>>>>>>> reading it
> >> >> >> >>>>>>>> for every packet lowers performance by 800kpps.
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> For odp-dpdk implementation it can look like:
> >> >> >> >>>>>>>> /** @internal Inline function @param uarea @return */
> >> >> >> >>>>>>>> static inline odp_packet_t _odp_packet_from_user_area(
> void
> >> >> >> >>>>>>>> *uarea)
> >> >> >> >>>>>>>> {
> >> >> >> >>>>>>>>        return (odp_packet_t)((uintptr_t)uarea -
> >> >> >> >>>>>>>> _odp_packet_inline.udata);
> >> >> >> >>>>>>>> }
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> Please let me know what you think.
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> Thanks,
> >> >> >> >>>>>>>> Michal
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Reply via email to