On Fri, Jul 15, 2022 at 07:16:17PM +0530, Jerin Jacob wrote:
> On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
> <anatoly.bura...@intel.com> wrote:
> >
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL
> > API that will be able to passively track core busyness.
> >
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets. Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore busyness, we
> > simply call the telemetry timestamping function with the number of polls
> > a particular code section has processed, and count the number of cycles
> > we've spent processing empty bursts. The more empty bursts we encounter,
> > the less cycles we spend in "busy" state, and the less core busyness
> > will be reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following parts
> > of DPDK are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping function
> > is wrapped into a macro, so that whenever telemetry is disabled, it only
> > takes one additional branch and no function calls are performed. It is
> > also possible to disable it at compile time by commenting out
> > RTE_LCORE_BUSYNESS from build config.
> >
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> >
> > Signed-off-by: Kevin Laatz <kevin.la...@intel.com>
> > Signed-off-by: Conor Walsh <conor.wa...@intel.com>
> > Signed-off-by: David Hunt <david.h...@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
> 
> 
> It is a good feature. Thanks for this work.
> 

+1
Some follow-up comments inline below.

/Bruce

> 
> > ---
> >
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes any 
> > performance
> >     degradation, and it seemed to have none that we could measure. 
> > Telemetry can be
> >     disabled at compile time via a config option, while at runtime it can be
> >     disabled, seemingly at a cost of one additional branch.
> >
> >     That said, our benchmarking efforts were admittedly not very rigorous, 
> > so
> >     comments welcome!
> 
> >
> > diff --git a/config/rte_config.h b/config/rte_config.h
> > index 46549cb062..583cb6f7a5 100644
> > --- a/config/rte_config.h
> > +++ b/config/rte_config.h
> > @@ -39,6 +39,8 @@
> >  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
> >  #define RTE_BACKTRACE 1
> >  #define RTE_MAX_VFIO_CONTAINERS 64
> > +#define RTE_LCORE_BUSYNESS 1
> 
> Please don't enable debug features in fastpath as default.
> 

I would disagree that this is a debug feature. The number of times I have
heard from DPDK users that they wish to get more visibility into what the
app is doing rather than seeing 100% cpu busy. Therefore, I'd see this as
enabled by default rather than disabled - unless it's shown to have a
performance regression.

That said, since this impacts multiple components and it's something that
end users might want to disable at build-time, I'd suggest moving it to
meson_options.txt file and have it as an official DPDK build-time option.

> > +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
> >
> > +
> > +#include <unistd.h>
> > +#include <limits.h>
> > +#include <string.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_cycles.h>
> > +#include <rte_errno.h>
> > +#include <rte_lcore.h>
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> 
> This clutter may not be required. Let it compile all cases.
> 
> > +#include <rte_telemetry.h>
> > +#endif
> > +
> > +int __rte_lcore_telemetry_enabled;
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +
> > +struct lcore_telemetry {
> > +       int busyness;
> > +       /**< Calculated busyness (gets set/returned by the API) */
> > +       int raw_busyness;
> > +       /**< Calculated busyness times 100. */
> > +       uint64_t interval_ts;
> > +       /**< when previous telemetry interval started */
> > +       uint64_t empty_cycles;
> > +       /**< empty cycle count since last interval */
> > +       uint64_t last_poll_ts;
> > +       /**< last poll timestamp */
> > +       bool last_empty;
> > +       /**< if last poll was empty */
> > +       unsigned int contig_poll_cnt;
> > +       /**< contiguous (always empty/non empty) poll counter */
> > +} __rte_cache_aligned;
> > +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
> 
> Allocate this from hugepage.
> 

Yes, whether or not it's allocated from hugepages, dynamic allocation would
be better than having static vars.

> > +
> > +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> > +{
> > +       const unsigned int lcore_id = rte_lcore_id();
> > +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> > +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
> > +       const bool empty = nb_rx == 0;
> > +       uint64_t diff_int, diff_last;
> > +       bool last_empty;
> > +
> > +       last_empty = tdata->last_empty;
> > +
> > +       /* optimization: don't do anything if status hasn't changed */
> > +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
> > +               return;
> > +       /* status changed or we're waiting for too long, reset counter */
> > +       tdata->contig_poll_cnt = 0;
> > +
> > +       cur_tsc = rte_rdtsc();
> > +
> > +       interval_ts = tdata->interval_ts;
> > +       empty_cycles = tdata->empty_cycles;
> > +       last_poll_ts = tdata->last_poll_ts;
> > +
> > +       diff_int = cur_tsc - interval_ts;
> > +       diff_last = cur_tsc - last_poll_ts;
> > +
> > +       /* is this the first time we're here? */
> > +       if (interval_ts == 0) {
> > +               tdata->busyness = LCORE_BUSYNESS_MIN;
> > +               tdata->raw_busyness = 0;
> > +               tdata->interval_ts = cur_tsc;
> > +               tdata->empty_cycles = 0;
> > +               tdata->contig_poll_cnt = 0;
> > +               goto end;
> > +       }
> > +
> > +       /* update the empty counter if we got an empty poll earlier */
> > +       if (last_empty)
> > +               empty_cycles += diff_last;
> > +
> > +       /* have we passed the interval? */
> > +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
> 
> 
> I think, this function logic can be limited to just updating the
> timestamp in the ring buffer,
> and another control function of telemetry which runs in control core to do
> heavy lifting to reduce the performance impact on fast path,
> 
> > +               int raw_busyness;
> > +
> > +               /* get updated busyness value */
> > +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, 
> > diff_int);
> > +
> > +               /* set a new interval, reset empty counter */
> > +               tdata->interval_ts = cur_tsc;
> > +               tdata->empty_cycles = 0;
> > +               tdata->raw_busyness = raw_busyness;
> > +               /* bring busyness back to 0..100 range, biased to round up 
> > */
> > +               tdata->busyness = (raw_busyness + 50) / 100;
> > +       } else
> > +               /* we may have updated empty counter */
> > +               tdata->empty_cycles = empty_cycles;
> > +
> > +end:
> > +       /* update status for next poll */
> > +       tdata->last_poll_ts = cur_tsc;
> > +       tdata->last_empty = empty;
> > +}
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> > +       do {                                                    \
> > +               if (__rte_lcore_telemetry_enabled)              \
> 
> I think, rather than reading memory, Like Linux perf infrastructure,
> we can patch up the instruction steam as NOP vs Timestamp capture
> function.
> 
Surely that requires much more complicated tooling? How would that work in
this situation?

> 
> Also instead of changing all libraries, Maybe we can use
> "-finstrument-functions".
> and just mark the function with attribute.
> 
> Just 2c.
> 
> > +                       __rte_lcore_telemetry_timestamp(nb_rx); \
> > +       } while (0)
> > +#else
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> > +       while (0)
> > +#endif
> > +

Reply via email to