yes, configurable headroom purpose is very different from user_area and from user_pointer. ODP provide all.
Headroom can be used for many reasons but the primary one is to prepend an additional protocol header such as IPsec. The user_area is a design pattern found in many frameworks (it is very frequent in the kernel). There should be a way to get the "containing" upper-layer pointer out of a user_area pointer. VPP assumes the underlying packet framework provides a user_area where it will place its own metadata. At some point in time there is a need to convert this metadata pointer back to underlying packet reference. They could have decided to embed such reference in their metadata but they did not because of cost (typically a 64 bit quantity) and because user_area design pattern typically allocates contiguous memory to upper-layer metadata and hence finding the containing metadata is done through arithmetic on the pointer. user_pointer is different from user_area for the following reasons: - user_area can be prefetched the same time the main meta data is prefixed, hence latency to access information is lower than when pointed to - usera_area is typically small while user_pointer can be arbitraly big - user_area has a one-to-one mapping to the upper layer while user_pointer may be pointing to a shared object (for instance a "flow" or a "set of flows for the same network user") Getting the original metadata from the user_pointer may be irrelevant in the case of shared objects. So the design pattern does not provide such conversion. Typically the translation is expected to be done through a back reference contained in the object pointed to by the user_pointer. FF On 9 December 2017 at 22:50, Bill Fischofer <bill.fischo...@linaro.org> wrote: > We certainly should have this discussion, however I believe what VPP > really wants is the ability to associate a managed prefix with packets, as > this is how the VLIB buffer is currently being used. > > ODP packets have the notion of a configurable headroom area that precedes > and is contiguous with the initial packet segment. The > odp_packet_push_head() API is used to extend packets into this area while > odp_packet_head() provides the address of the start of the headroom area > that is prefixed to the packet. Thus, per the ODP API spec. > odp_packet_head() + odp_packet_headroom() == odp_packet_data(). > > It's conceivable this relationship could be used by VPP, however at > present ODP provides no guarantee that unallocated headroom is preserved > across any ODP packet operation. The user area, by contrast, is preserved > as well as copied when needed when packets are reallocated as part of more > complex packet operations. So it's conceivable we could formalize the > notion of a managed prefix that would provide the sort of arithmetic > management that VPP wants. > > However, we really need to better understand the testing environment and > the measurements that are being done here before rushing to introduce new > APIs in the hope that they will provide performance benefits. In the data > cited above, it seems this particular test configuration is unable to > achieve 10Gb line rate under the best of circumstances, which means > something more fundamental is going on that needs to be understood first. > > > > On Sat, Dec 9, 2017 at 6:53 AM, Francois Ozog <francois.o...@linaro.org> > wrote: > >> I'd like to have SoC vendors feedback on the discussion and on the >> following: >> >> This ODP user area is used by VPP to put a vlib_buffer information. VPP >> is just one of the applications that will use this simple mechanism and >> that is certainly a design pattern. >> >> A user area can exist when an odp packet is in the host, that is it has >> been received on a special queue. >> The odp_packet_t must be from a pool whose user_area size is big enough >> to hold such meta-data >> >> today we have >> void *odp_packet_user_area(odp_packet_t pkt); >> uint32_t odp_packet_user_area_size(odp_packet_t pkt); >> >> So the question is, can we have: >> odp_packet_t pkt odp_packet_from_user_area(void * user_area); >> >> >> >> 1) I wonder though if all user area calls should't be associated to the >> mempool that was configured with the desired user_area size. >> 2) implementation assessment: >> >> - Cavium and NXP have formed odp_packet_t as physical address and >> additional bits. So their implementation can be fairly optimized by >> calculation if user_area is allocated consecutively to the packet buffer >> - linux generic can have its own way of dealing with it. >> - odp_dpdk can implement it the way Michal mentioned >> - ODP with DDF can use calculation of odp_buffer_t the same way DPDK >> does >> >> >> I strongly think that the proposed API is a must in the design pattern >> related to user_areas. That is, it is NOT specific to VPP. >> >> There cannot be a single way to implement way due to the ODP nature, >> every implementation have to deal with it in its own way. >> >> I guess it it is not possible under some implementations to do it by >> calculation. In that case the new API impose an implementation change that >> may be based on a hash table (because user area may not be contiguous to >> buffer). That looks bad but at least that is generic implementation. >> Alternatively, it may be said that this function can return NULL on some >> platforms. It will just prevent some applications to run on those >> platforms. Which is also a design choice that would push the upper layers >> to add the information (odp_packet_t in our case) in their meta data >> (blib_buffer in the case of VPP) to be compatible. >> >> FF >> >> On 8 December 2017 at 22:55, Bill Fischofer <bill.fischo...@linaro.org> >> wrote: >> >>> On Fri, Dec 8, 2017 at 2:49 PM, Honnappa Nagarahalli < >>> honnappa.nagaraha...@linaro.org> wrote: >>> >>> > On 8 December 2017 at 13:40, Bill Fischofer <bill.fischo...@linaro.org >>> > >>> > wrote: >>> > > >>> > > >>> > > On Fri, Dec 8, 2017 at 1:06 PM, Honnappa Nagarahalli >>> > > <honnappa.nagaraha...@linaro.org> wrote: >>> > >> >>> > >> On 7 December 2017 at 22:36, Bill Fischofer < >>> bill.fischo...@linaro.org> >>> > >> wrote: >>> > >> > >>> > >> > >>> > >> > On Thu, Dec 7, 2017 at 10:12 PM, Honnappa Nagarahalli >>> > >> > <honnappa.nagaraha...@linaro.org> wrote: >>> > >> >> >>> > >> >> On 7 December 2017 at 17:36, Bill Fischofer < >>> > bill.fischo...@linaro.org> >>> > >> >> wrote: >>> > >> >> > >>> > >> >> > >>> > >> >> > On Thu, Dec 7, 2017 at 3:17 PM, Honnappa Nagarahalli >>> > >> >> > <honnappa.nagaraha...@linaro.org> wrote: >>> > >> >> >> >>> > >> >> >> This experiment clearly shows the need for providing an API in >>> > ODP. >>> > >> >> >> >>> > >> >> >> On ODP2.0 implementations such an API will be simple enough >>> > >> >> >> (constant >>> > >> >> >> subtraction), requiring no additional storage in VLIB. >>> > >> >> >> >>> > >> >> >> Michal, can you send a PR to ODP for the API so that we can >>> debate >>> > >> >> >> the >>> > >> >> >> feasibility of the API for Cavium/NXP platforms. >>> > >> >> > >>> > >> >> > >>> > >> >> > That's the point. An API that is tailored to a specific >>> > >> >> > implementation >>> > >> >> > or >>> > >> >> > application is not what ODP is about. >>> > >> >> > >>> > >> >> How are the requirements coming to ODP APIs currently? My >>> > >> >> understanding is, it is coming from OFP and Petri's requirements. >>> > >> >> Similarly, VPP is also an application of ODP. Recently, Arm >>> community >>> > >> >> (Arm and partners) prioritized on the open source projects that >>> are >>> > of >>> > >> >> importance and came up with top 50 (or 100) projects. If I >>> remember >>> > >> >> correct VPP is among top single digits (I am trying to get the >>> exact >>> > >> >> details). So, it is an application of significant interest. >>> > >> > >>> > >> > >>> > >> > VPP is important, but what's important is for VPP to perform >>> > >> > significantly >>> > >> > better on at least one ODP implementation than it does today using >>> > DPDK. >>> > >> > If >>> > >> > we can't demonstrate that then there's no point to the ODP4VPP >>> > project. >>> > >> > That's not going to happen on x86 since we can assume that >>> VPP/DPDK is >>> > >> > optimal here since VPP has been tuned to DPDK internals. So we >>> need to >>> > >> > focus >>> > >> > the performance work on Arm SoC platforms that offer significant >>> HW >>> > >> > acceleration capabilities that VPP can exploit via ODP4VPP. >>> > >> >>> > >> VPP can exploit these capabilities through DPDK as well (may be few >>> > >> APIs are missing, but they will be available soon) as Cavium/NXP >>> > >> platforms support DPDK. >>> > > >>> > > >>> > > If the goal is to be "just as good" as DPDK then we fail because a >>> VPP >>> > > application doesn't see or care whether DPDK or ODP is running >>> underneath >>> > > it. The requirement is for VPP applications to run significantly >>> (2x, 4x, >>> > > etc.) better using ODP4VPP than DPDK. That won't come by fine-tuning >>> > what's >>> > > fundamentally the same code, but rather by eliminating entire >>> processing >>> > > steps by, for example, exploiting inline IPsec acceleration. >>> > > >>> > The point I am trying to make is, VPP can exploit inline IPsec >>> > acceleration through DPDK as well (if those APIs are not available in >>> > DPDK, they will be available soon). So, what use cases we will look at >>> > that point? We need to be looking at all the use cases and we need to >>> > be better at all them. >>> > >>> >>> We know that DPDK is "chasing" ODP in this and other areas. My >>> assumption, >>> however, is that Intel will never allow DPDK on Arm to be better than >>> DPDK >>> on x86, for obvious marketing reasons. That's why we need to "push the >>> envelope" with ODP on Arm since we're under no such constraints. >>> >>> >>> > >>> > >> >>> > >> >>> > >> This API is the basic API that is required for any use case. I do >>> not >>> > >> understand why this API is not required for IPsec acceleration in >>> NXP. >>> > >> If we store odp_packet_t in VLIB buffer, it will affect the >>> > >> performance of IPsec performance on NXP platform as well. >>> > > >>> > > >>> > > That sounds like an assumption rather than a measurement. We should >>> let >>> > > Nikhil weigh in here about the key drivers to achieving best >>> performance >>> > on >>> > > NXP platforms. >>> > > >>> > Well, not exactly assumption, based on working on similar >>> > optimizations, we all have done enough cache line related >>> > optimizations in Linux-Generic now. We can do it again, Sachin has the >>> > code as well. >>> > >>> >>> That would be good to know. >>> >>> >>> > >>> > >> >>> > >> >>> > >> This isn't one >>> > >> > of those. The claim is that with or without this change ODP4VPP >>> on x86 >>> > >> > performs worse than VPP/DPDK on x86. >>> > >> >>> > >> That does not mean, we do not work on increasing the performance of >>> > >> ODP4VPP on x86. This API will help catch up on the performance. >>> > > >>> > > >>> > > My point is that the best you can hope for on x86 is to be no >>> different >>> > than >>> > > DPDK. That's a fail, so such tuning isn't germane to ODP4VPP's >>> success. >>> > We >>> > > need to be focusing on exploiting HW acceleration paths, not trying >>> to >>> > equal >>> > > DPDK SW paths. That's how ODP4VPP will show value. >>> > >>> > Not all Arm platforms have accelerators. There are server CPUs which >>> > are just CPU cores without any accelerators. On such platforms, it is >>> > only software, and every optimization becomes important. These >>> > optimizations are important for platforms with accelerators as well, >>> > why waste cycles? >>> > >>> > odp-dpdk is being used temporarily. Once ODP2.0 is available, the same >>> > solution applies for 2.0 as well. >>> > >>> > So, we need to take out DPDK and x86 from this discussion and have ODP >>> > and Arm to keep the discussion meaningful. >>> > >>> >>> Agreed, but we're not talking about ODP applications here. We're talking >>> about a specific DPDK application (VPP) that we're trying to port to ODP >>> with minimal change. That may or may not be feasible as a pure SW >>> exercise, >>> which is why we need to focus on the accelerated paths rather than trying >>> to turn ODP into a DPDK wannabe. >>> >>> The case for optimal ODP SW implementations is to promote ODP >>> applications, >>> which is not the case here since VPP is not an ODP application. It's >>> actually a DPDK application even though they claim not to be. If we were >>> to >>> turn it into an ODP application it would likely fare better, but that's >>> beyond the scope of this project. >>> >>> >>> > >>> > > >>> > >> >>> > >> >>> > >> > >>> > >> > Since VPP applications don't change if ODP4VPP is in the picture >>> or >>> > not, >>> > >> > it >>> > >> > doesn't matter whether it's used on x86, so tuning ODP4VPP on x86 >>> is >>> > at >>> > >> > best >>> > >> > of secondary importance. We just need at least one Arm platform on >>> > which >>> > >> > VPP >>> > >> > applications run dramatically better than without it. >>> > >> >>> > >> This is not tuning for only x86 platform, it is a tuning that would >>> > >> apply to any platform. >>> > > >>> > > >>> > > No it isn't because you don't know that other platforms have the same >>> > > brittle cache characteristics as this particular measured x86 case. >>> And >>> > even >>> > > here I don't think we've really explored what cache tuning is >>> available >>> > to >>> > > better exploit prefetching. >>> > >>> > One of the defined goals of ODP is to perform well on x86 as well. >>> > There are multiple ways to solve the problem and this is one way which >>> > is showing better performance results. And the implementation that is >>> > being used in this test applies to ODP 2.0. This test is also proving >>> > that, implementation can do a better job at getting the handle to the >>> > packet. >>> > >>> >>> Again, the purpose of good ODP performance on x86 is to support ODP >>> applications. A well-written ODP application should perform as well on >>> x86 >>> as it would were it written as a DPDK application and dramatically better >>> on Arm platforms that have HW acceleration capabilities that have no x86 >>> equivalents. That's the whole justification for the ODP project. >>> >>> >>> > >>> > > >>> > >> >>> > >> >>> > >> > >>> > >> >> >>> > >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> On 7 December 2017 at 14:08, Bill Fischofer >>> > >> >> >> <bill.fischo...@linaro.org> >>> > >> >> >> wrote: >>> > >> >> >> > On Thu, Dec 7, 2017 at 12:22 PM, Michal Mazur >>> > >> >> >> > <michal.ma...@linaro.org> >>> > >> >> >> > wrote: >>> > >> >> >> > >>> > >> >> >> >> Native VPP+DPDK plugin knows the size of rte_mbuf header >>> and >>> > >> >> >> >> subtracts >>> > >> >> >> >> it >>> > >> >> >> >> from the vlib pointer. >>> > >> >> >> >> >>> > >> >> >> >> struct rte_mbuf *mb0 = rte_mbuf_from_vlib_buffer (b0); >>> > >> >> >> >> #define rte_mbuf_from_vlib_buffer(x) (((struct rte_mbuf >>> *)x) - >>> > 1) >>> > >> >> >> >> >>> > >> >> >> > >>> > >> >> >> > No surprise that VPP is a DPDK application, but I thought >>> they >>> > >> >> >> > wanted >>> > >> >> >> > to >>> > >> >> >> > be >>> > >> >> >> > independent of DPDK. The problem is that ODP is never going >>> to >>> > >> >> >> > match >>> > >> >> >> > DPDK >>> > >> >> >> > at an ABI level on x86 so we can't be fixated on x86 >>> performance >>> > >> >> >> > comparisons between ODP4VPP and VPP/DPDK. >>> > >> >> >> Any reason why we will not be able to match or exceed the >>> > >> >> >> performance? >>> > >> >> > >>> > >> >> > >>> > >> >> > It's not that ODP can't have good performance on x86, it's that >>> > DPDK >>> > >> >> > encourages apps to be very dependent on DPDK implementation >>> details >>> > >> >> > such >>> > >> >> > as >>> > >> >> > seen here. ODP is not going to match DPDK internals so >>> applications >>> > >> >> > that >>> > >> >> > exploit such internals will always see a difference. >>> > >> >> > >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> What we need to do is compare >>> > >> >> >> > ODP4VPP on Arm-based SoCs vs. "native VPP" that can't take >>> > >> >> >> > advantage >>> > >> >> >> > of >>> > >> >> >> > the >>> > >> >> >> > HW acceleration present on those platforms. That's how we >>> get to >>> > >> >> >> > show >>> > >> >> >> > dramatic differences. If ODP4VPP is only within a few >>> percent >>> > >> >> >> > (plus >>> > >> >> >> > or >>> > >> >> >> > minus) of VPP/DPDK there's no point of doing the project at >>> all. >>> > >> >> >> > >>> > >> >> >> > So my advice would be to stash the handle in the VLIB >>> buffer for >>> > >> >> >> > now >>> > >> >> >> > and >>> > >> >> >> > focus on exploiting the native IPsec acceleration >>> capabilities >>> > >> >> >> > that >>> > >> >> >> > ODP >>> > >> >> >> > will permit. >>> > >> >> >> > >>> > >> >> >> > >>> > >> >> >> >> On 7 December 2017 at 19:02, Bill Fischofer >>> > >> >> >> >> <bill.fischo...@linaro.org> >>> > >> >> >> >> wrote: >>> > >> >> >> >> >>> > >> >> >> >>> Ping to others on the mailing list for opinions on this. >>> What >>> > >> >> >> >>> does >>> > >> >> >> >>> "native" VPP+DPDK get and how is this problem solved >>> there? >>> > >> >> >> >>> >>> > >> >> >> >>> On Thu, Dec 7, 2017 at 11:55 AM, Michal Mazur >>> > >> >> >> >>> <michal.ma...@linaro.org> >>> > >> >> >> >>> wrote: >>> > >> >> >> >>> >>> > >> >> >> >>>> The _odp_packet_inline is common for all packets and >>> takes up >>> > >> >> >> >>>> to >>> > >> >> >> >>>> two >>> > >> >> >> >>>> cachelines (it contains only offsets). Reading pointer >>> for >>> > each >>> > >> >> >> >>>> packet from >>> > >> >> >> >>>> VLIB would require to fetch 10 million cachelines per >>> second. >>> > >> >> >> >>>> Using prefetches does not help. >>> > >> >> >> >>>> >>> > >> >> >> >>>> On 7 December 2017 at 18:37, Bill Fischofer >>> > >> >> >> >>>> <bill.fischo...@linaro.org> >>> > >> >> >> >>>> wrote: >>> > >> >> >> >>>> >>> > >> >> >> >>>>> Yes, but _odp_packet_inline.udate is clearly not in the >>> VLIB >>> > >> >> >> >>>>> cache >>> > >> >> >> >>>>> line >>> > >> >> >> >>>>> either, so it's a separate cache line access. Are you >>> seeing >>> > >> >> >> >>>>> this >>> > >> >> >> >>>>> difference in real runs or microbenchmarks? Why isn't >>> the >>> > >> >> >> >>>>> entire >>> > >> >> >> >>>>> VLIB being >>> > >> >> >> >>>>> prefetched at dispatch? Sequential prefetching should >>> add >>> > >> >> >> >>>>> negligible >>> > >> >> >> >>>>> overhead. >>> > >> >> >> >>>>> >>> > >> >> >> >>>>> On Thu, Dec 7, 2017 at 11:13 AM, Michal Mazur >>> > >> >> >> >>>>> <michal.ma...@linaro.org> >>> > >> >> >> >>>>> wrote: >>> > >> >> >> >>>>> >>> > >> >> >> >>>>>> It seems that only first cache line of VLIB buffer is >>> in >>> > L1, >>> > >> >> >> >>>>>> new >>> > >> >> >> >>>>>> pointer can be placed only in second cacheline. >>> > >> >> >> >>>>>> Using constant offset between user area and ODP header >>> i >>> > get >>> > >> >> >> >>>>>> 11 >>> > >> >> >> >>>>>> Mpps, >>> > >> >> >> >>>>>> with pointer stored in VLIB buffer only 10Mpps and with >>> > this >>> > >> >> >> >>>>>> new >>> > >> >> >> >>>>>> api >>> > >> >> >> >>>>>> 10.6Mpps. >>> > >> >> >> >>>>>> >>> > >> >> >> >>>>>> On 7 December 2017 at 18:04, Bill Fischofer >>> > >> >> >> >>>>>> <bill.fischo...@linaro.org >>> > >> >> >> >>>>>> > wrote: >>> > >> >> >> >>>>>> >>> > >> >> >> >>>>>>> How would calling an API be better than referencing >>> the >>> > >> >> >> >>>>>>> stored >>> > >> >> >> >>>>>>> data >>> > >> >> >> >>>>>>> yourself? A cache line reference is a cache line >>> > reference, >>> > >> >> >> >>>>>>> and >>> > >> >> >> >>>>>>> presumably >>> > >> >> >> >>>>>>> the VLIB buffer is already in L1 since it's your >>> active >>> > >> >> >> >>>>>>> data. >>> > >> >> >> >>>>>>> >>> > >> >> >> >>>>>>> On Thu, Dec 7, 2017 at 10:45 AM, Michal Mazur < >>> > >> >> >> >>>>>>> michal.ma...@linaro.org> wrote: >>> > >> >> >> >>>>>>> >>> > >> >> >> >>>>>>>> Hi, >>> > >> >> >> >>>>>>>> >>> > >> >> >> >>>>>>>> For odp4vpp plugin we need a new API function which, >>> > given >>> > >> >> >> >>>>>>>> user >>> > >> >> >> >>>>>>>> area >>> > >> >> >> >>>>>>>> pointer, will return a pointer to ODP packet buffer. >>> It >>> > is >>> > >> >> >> >>>>>>>> needed >>> > >> >> >> >>>>>>>> when >>> > >> >> >> >>>>>>>> packets processed by VPP are sent back to ODP and >>> only a >>> > >> >> >> >>>>>>>> pointer >>> > >> >> >> >>>>>>>> to >>> > >> >> >> >>>>>>>> VLIB >>> > >> >> >> >>>>>>>> buffer data (stored inside user area) is known. >>> > >> >> >> >>>>>>>> >>> > >> >> >> >>>>>>>> I have tried to store the ODP buffer pointer in VLIB >>> data >>> > >> >> >> >>>>>>>> but >>> > >> >> >> >>>>>>>> reading it >>> > >> >> >> >>>>>>>> for every packet lowers performance by 800kpps. >>> > >> >> >> >>>>>>>> >>> > >> >> >> >>>>>>>> For odp-dpdk implementation it can look like: >>> > >> >> >> >>>>>>>> /** @internal Inline function @param uarea @return */ >>> > >> >> >> >>>>>>>> static inline odp_packet_t >>> _odp_packet_from_user_area( >>> > void >>> > >> >> >> >>>>>>>> *uarea) >>> > >> >> >> >>>>>>>> { >>> > >> >> >> >>>>>>>> return (odp_packet_t)((uintptr_t)uarea - >>> > >> >> >> >>>>>>>> _odp_packet_inline.udata); >>> > >> >> >> >>>>>>>> } >>> > >> >> >> >>>>>>>> >>> > >> >> >> >>>>>>>> Please let me know what you think. >>> > >> >> >> >>>>>>>> >>> > >> >> >> >>>>>>>> Thanks, >>> > >> >> >> >>>>>>>> Michal >>> > >> >> >> >>>>>>>> >>> > >> >> >> >>>>>>> >>> > >> >> >> >>>>>>> >>> > >> >> >> >>>>>> >>> > >> >> >> >>>>> >>> > >> >> >> >>>> >>> > >> >> >> >>> >>> > >> >> >> >> >>> > >> >> > >>> > >> >> > >>> > >> > >>> > >> > >>> > > >>> > > >>> > >>> >> >> >> >> -- >> [image: Linaro] <http://www.linaro.org/> >> François-Frédéric Ozog | *Director Linaro Networking Group* >> T: +33.67221.6485 >> francois.o...@linaro.org | Skype: ffozog >> >> > -- [image: Linaro] <http://www.linaro.org/> François-Frédéric Ozog | *Director Linaro Networking Group* T: +33.67221.6485 francois.o...@linaro.org | Skype: ffozog