Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement reference apis

Maxim Uvarov Sun, 15 Jan 2017 23:11:04 -0800

Might be off-topic but do we need to check somewhere in odp tests that cpu
frequency scaling is turend off? My latop likes to play with frequencies
and even warn up does not work because it goes to maximum speed, then
overheated and cpu frequency lowered. If we count time in cpu cycles -
measurements should give about the same numbers of cycles, but if we count
real time, then there might be different results.


Maxim.

On 15 January 2017 at 18:08, Bill Fischofer <bill.fischo...@linaro.org>
wrote:

> I've been playing around with odp_bench_packet and there are two
> problems with using this test:
>
> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT
> times, which is good, however it only reports the average cycle time
> for these runs, There is considerable variability when running in a
> non-isolated environment, so what we really want to report is not just
> the average times, but also the minimum and maximum times encountered
> over those runs. From a microbenchmarking standpoint, the minimum
> times are what are of most interest since those are the "pure"
> measures of the functions being tested, with the least amount of
> interference from other sources. Statistically speaking, with a large
> TEST_SIZE_RUN_COUNT the minimums should get close to the actual
> isolated time for each of the tests without requiring the system be
> fully isolated.
>
> 2. Within each test, the function being tested is performed
> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason
> this is bad is that longer run times give more opportunity for "cut
> aways" and other interruptions that distort the measurement trying to
> be made.
>
> I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work
> (need to debug that further) but TEST_REPEAT_COUNT_2 does work and
> we'll consider that good enough for now. With TEST_REPEAT_COUNT set to
> 2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and
> reporting of min/max as well as average times, we get closer to what
> we're really trying to measure here.
>
> So here are the results I'm seeing for the alloc/free tests. I'm only
> looking at the 64-byte runs here (after "warmup") as that's the most
> relevant to what we're trying to measure and tune in this case.
>
> For the "baseline" without packet references support added we see this:
>
> Packet length:     64 bytes
> ---------------------------
> bench_empty                   : 18 min 59 max     11.4 avg
> bench_packet_alloc            : 92 min 1230 max     57.0 avg
> bench_packet_alloc_multi      : 374 min 5440 max    205.0 avg
> bench_packet_free             : 54 min 1360 max     39.6 avg
> bench_packet_free_multi       : 178 min 36343 max    116.0 avg
> bench_packet_alloc_free       : 140 min 41347 max    123.0 avg
> bench_packet_alloc_free_multi : 540 min 41632 max    330.4 avg
>
> The original reference code (what's currently in api-next) shows this:
>
> Packet length:     64 bytes
> ---------------------------
> bench_empty                   : 18 min 788 max     11.5 avg
> bench_packet_alloc            : 93 min 1084 max     61.9 avg
> bench_packet_alloc_multi      : 416 min 6874 max    242.5 avg
> bench_packet_free             : 102 min 744 max     74.6 avg
> bench_packet_free_multi       : 951 min 1513 max    499.4 avg
> bench_packet_alloc_free       : 223 min 37317 max    164.6 avg
> bench_packet_alloc_free_multi : 1403 min 52753 max    795.1 avg
>
> With the v1 optimization patch applied:
>
> Packet length:     64 bytes
> ---------------------------
> bench_empty                   : 18 min 519 max     11.7 avg
> bench_packet_alloc            : 93 min 35733 max     97.2 avg
> bench_packet_alloc_multi      : 419 min 50064 max    249.9 avg
> bench_packet_free             : 66 min 35564 max     62.2 avg
> bench_packet_free_multi       : 538 min 717 max    283.2 avg
> bench_packet_alloc_free       : 178 min 40694 max    135.5 avg
> bench_packet_alloc_free_multi : 1000 min 41731 max    614.6 avg
>
> And finally, with the v3 optimization patch applied:
>
> Packet length:     64 bytes
> ---------------------------
> bench_empty                   : 18 min 305 max     11.6 avg
> bench_packet_alloc            : 93 min 1076 max     62.8 avg
> bench_packet_alloc_multi      : 413 min 122717 max    364.2 avg
> bench_packet_free             : 57 min 522 max     39.4 avg
> bench_packet_free_multi       : 268 min 1852 max    159.9 avg
> bench_packet_alloc_free       : 181 min 1760 max    128.2 avg
> bench_packet_alloc_free_multi : 812 min 34288 max    476.4 avg
>
> A few things to note:
>
> 1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while
> the averages are computed results that are scaled differently. As a
> result, they cannot be compared directly with the min/max numbers,
> which is why we have anomalies like the "average" being lower than the
> "min". I should normalize these to be on the same scales but I haven't
> done that yet. So for now ignore the averages except to note that they
> move around a lot.
>
> 2. Bench_empty, which is the "do nothing" test used to calibrate the
> fixed overhead of these tests, reports the same minimums across all of
> these while the maxes and averages differ significantly. This proves
> the point about the need to use minimums for these sort of tests.
>
> 3. The mins for bench_packet_alloc is pretty much the same across all
> of these (92 vs 93). This is reassuring since as noted in my original
> response, the code deltas here are just a couple of extra metadata
> fields being initialized, so the widely differing averages are an
> artifact of the problems noted at the top of this post.
>
> 4. The bench_packet_alloc_multi test shows a bit more overhead (374
> vs. 413-419).  This warrants a bit more investigation, but I haven't
> done that yet. Note, however, that this is nowhere near the doubling
> that was originally reported.
>
> 5. The bench_packet_free test shows the problem originally identified
> with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57)
> have brought that back down to what should be insignificant levels.
>
> 6. The bench_packet_free_multi test also shows the original problem
> (178 vs 951) as well as the improvements made by the optimizations (v1
> = 538, v3 = 268). There's probably still some room for improvement
> here, but clearly this is a tractable problem at this point.
>
> I'll work my experiments with odp_bench_packet into a patch set that
> can be used to improve this very worthwhile framework.
>
>
> On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer
> <bill.fischo...@linaro.org> wrote:
> > Please ignore v2. It had a memory leak that Matias' excellent
> > odp_bench_packet test found. A corrected v3 is posted at
> > http://patches.opendataplane.org/patch/7879/
> >
> > On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer
> > <bill.fischo...@linaro.org> wrote:
> >> Hi Petri,
> >>
> >> I've posted v2 of the tuning patch at
> >> http://patches.opendataplane.org/patch/7878/ which adds optimization
> >> to the odp_packet_free_multi() routine as well as odp_packet_free().
> >> So both of these should now be a very similar for freeing reference
> >> and non-reference packets.
> >>
> >> I'd still like access to the program that you were using to generate
> >> your cycle stats. The odp_scheduling test in the performance test dir
> >> seems too erratic from run to run to enable this sort of comparison.
> >>
> >> One restructure included in this patch should have benefit for
> >> non-reference paths as well as reference paths. I noticed that the
> >> buffer_free_multi() routine is prepared to handle buffer lists coming
> >> from different pools and does the checking for pool crossings in that
> >> routine. This is wasteful because this routine is used internally when
> >> dealing with multi-segment packets and these internal uses never mix
> >> buffers from different pools. The only place that can happen is in the
> >> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so
> >> I've moved that checking up to those APIs so buffer_free_multi() now
> >> just front-ends buffer_free_to_pool(). The optimized internal packet
> >> free routines now call buffer_free_to_pool() directly.
> >>
> >> If you can confirm that these are heading in the right direction we
> >> can see what else might be needed from a tuning perspective.
> >>
> >> Thanks.
> >>
> >> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer
> >> <bill.fischo...@linaro.org> wrote:
> >>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an
> >>> initial step towards resolving this issue. Petri: can you make your
> >>> test program available to me so I can try to reproduce your results
> >>> locally. I don't have quite the scale or controlled environment that
> >>> you do but this should save iteration steps.
> >>>
> >>> This patch adds a fastpath to odp_packet_free() that should more
> >>> closely match the pre-refs code for non-reference packets.
> >>> odp_packet_free_multi() is difficult because there is no guarantee
> >>> that all of the packets passed to this are non-references so I'm not
> >>> sure how to match the prior code exactly in that case. This code just
> >>> calls packet_free() for each of the buffers so these individual calls
> >>> will be optimized but this still done via individual calls rather than
> >>> a single batched call.
> >>>
> >>> On the alloc side I'm not sure what could be going on other than the
> >>> fact that to support references requires additional packet metadata
> >>> which probably spills onto an extra cache line compared to the
> >>> original code. As noted earlier, the alloc() changes are simply
> >>> initializing this data, which shouldn't be more than a few extra
> >>> cycles.
> >>>
> >>> Thanks.
> >>>
> >>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer
> >>> <bill.fischo...@linaro.org> wrote:
> >>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer
> >>>> <bill.fischo...@linaro.org> wrote:
> >>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo)
> >>>>> <petri.savolai...@nokia-bell-labs.com> wrote:
> >>>>>>
> >>>>>>> >> > Huge performance degradation. Numbers are now many times
> worse than
> >>>>>>> >> before or after my optimizations. To me this shows that almost a
> >>>>>>> complete
> >>>>>>> >> rewrite (or revert) is needed.
> >>>>>>> >>
> >>>>>>> >> My guess is this is due to the atomics needed for reference
> counting
> >>>>>>> >> not being properly inlined internally. I know you did similar
> >>>>>>> >> optimizations for ticketlocks. Let me look into this and post a
> patch
> >>>>>>> >> that does the same for the atomics.
> >>>>>>> >
> >>>>>>> > Atomics are inlined already. Also atomic operations should not be
> >>>>>>> required if there's a single refence.
> >>>>>>>
> >>>>>>> The ref_count still needs to be initialized on alloc and
> decremented
> >>>>>>> on free to see if the packet should really be freed. The only way
> to
> >>>>>>> know whether you're dealing with a single reference or not is to
> >>>>>>> check. Do you have a breakdown of where the hotspots are? These
> >>>>>>> numbers do seem high.
> >>>>>>>
> >>>>>>
> >>>>>> No, I didn't look into it any deeper.
> >>>>>
> >>>>> If we look at the comparative pathlength for packet_alloc(), the only
> >>>>> changes are to the internal routines packet_init() and
> >>>>> init_segments(). The rest of the code is identical.
> >>>>>
> >>>>> For packet_init() the delta is these added two lines:
> >>>>>
> >>>>> /* By default packet has no references */
> >>>>> pkt_hdr->unshared_len = len;
> >>>>> pkt_hdr->ref_hdr = NULL;
> >>>>>
> >>>>> For init_segments() the delta is similarly tiny:
> >>>>>
> >>>>> packet_ref_count_set(hdr, 1);
> >>>>>
> >>>>> Where this is an inline function that expands to
> >>>>> odp_atomic_init_u32(&hdr->ref_count, 1);
> >>>>>
> >>>>> The changes on the packet_free() side are similarly tiny. Basically
> we
> >>>>> decrement the ref_count and only actually free the buffer if the
> >>>>> ref_count is now zero.
> >>>>>
> >>>>> I don't see how these translate into a near doubling of cycles for
> >>>>> these tests. It seems there must be something else going on.
> >>>>
> >>>> I think I see some potential larger deltas in the packet free path.
> >>>> I'll dig into this further and post a patch .
> >>>>
> >>>>>
> >>>>>>
> >>>>>> -Petri
>

Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement reference apis

Reply via email to