On 18.01.2018 20:21, O Mahony, Billy wrote:
> Hi All,
> 
> The thread gets updated before I can finish my review so I'm just going to 
> jump in to this thread rather than reply to the o/p with a fresh thread. I 
> hope that's ok.
> 
> I've run the patch and any issues with previous revisions are fixed.
> 
> Perhaps add an error message for stats-clear when the pmd indicated by -pmd 
> flag does not exist.
> 
> One or two other comments/questions inline, below.
> 
> Cheers,
> /Billy. 
> 
> 
>> -----Original Message-----
>> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
>> Sent: Thursday, January 18, 2018 4:35 PM
>> To: Ilya Maximets <i.maxim...@samsung.com>; d...@openvswitch.org
>> Cc: ktray...@redhat.com; Stokes, Ian <ian.sto...@intel.com>; O Mahony,
>> Billy <billy.o.mah...@intel.com>
>> Subject: RE: [PATCH v7 2/3] dpif-netdev: Detailed performance stats for
>> PMDs
>>
>> Hi Ilya,
>>
>> Thanks for your in-depth review and comments. Please find my answers
>> below.
>>
>> BR, Jan
>>
>>> -----Original Message-----
>>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>>> Sent: Wednesday, 17 January, 2018 13:13
>>> To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org
>>> Cc: ktray...@redhat.com; ian.sto...@intel.com;
>> billy.o.mah...@intel.com
>>> Subject: Re: [PATCH v7 2/3] dpif-netdev: Detailed performance stats for
>> PMDs
>>>
>>> Not a full review.
>>> Comments inline.
>>>
>>> On 16.01.2018 04:51, Jan Scheurich wrote:
>>>> This patch instruments the dpif-netdev datapath to record detailed
>>>> statistics of what is happening in every iteration of a PMD thread.
>>>>
>>>> The collection of detailed statistics can be controlled by a new
>>>> Open_vSwitch configuration parameter "other_config:pmd-perf-
>> metrics".
>>>> By default it is disabled. The run-time overhead, when enabled, is
>>>> in the order of 1%.
>>>>
>>>> The covered metrics per iteration are:
>>>>   - cycles
>>>>   - packets
>>>>   - (rx) batches
>>>>   - packets/batch
>>>>   - max. vhostuser qlen
>>>>   - upcalls
>>>>   - cycles spent in upcalls
>>>>
>>>> This raw recorded data is used threefold:
>>>>
>>>> 1. In histograms for each of the following metrics:
>>>>    - cycles/iteration (log.)
>>>>    - packets/iteration (log.)
>>>>    - cycles/packet
>>>>    - packets/batch
>>>>    - max. vhostuser qlen (log.)
>>>>    - upcalls
>>>>    - cycles/upcall (log)
>>>>    The histograms bins are divided linear or logarithmic.
>>>>
>>>> 2. A cyclic history of the above statistics for 999 iterations
>>>>
>>>> 3. A cyclic history of the cummulative/average values per millisecond
>>>>    wall clock for the last 1000 milliseconds:
>>>>    - number of iterations
>>>>    - avg. cycles/iteration
>>>>    - packets (Kpps)
>>>>    - avg. packets/batch
>>>>    - avg. max vhost qlen
>>>>    - upcalls
>>>>    - avg. cycles/upcall
>>>>
>>>> The gathered performance metrics can be printed at any time with the
>>>> new CLI command
>>>>
>>>> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
>>>>     [-pmd core | dp]
>>>>
>>>> The options are
>>>>
>>>> -nh:            Suppress the histograms
>>>> -it iter_len:   Display the last iter_len iteration stats
>>>> -ms ms_len:     Display the last ms_len millisecond stats
>>>> -pmd core:      Display only
>>>>
>>>> The performance statistics are reset with the existing
>>>> dpif-netdev/pmd-stats-clear command.
>>>>
>>>> The output always contains the following global PMD statistics,
>>>> similar to the pmd-stats-show command:
>>>>
>>>> Time: 15:24:55.270
>>>> Measurement duration: 1.008 s
>>>>
>>>> pmd thread numa_id 0 core_id 1:
>>>>
>>>>   Cycles:            2419034712  (2.40 GHz)
>>>>   Iterations:            572817  (1.76 us/it)
>>>>   - idle:                486808  (15.9 % cycles)
>>>>   - busy:                 86009  (84.1 % cycles)
>>>>   Packets:              2399607  (2381 Kpps, 848 cycles/pkt)
>>>>   Datapath passes:      3599415  (1.50 passes/pkt)
>>>>   - EMC hits:            336472  ( 9.3 %)
>>>>   - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>>>>   - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>>>>   - Lost upcalls:             0  ( 0.0 %)
>>>>
>>>> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
>>>> ---
>>>>  NEWS                        |   3 +
>>>>  lib/automake.mk             |   1 +
>>>>  lib/dp-packet.h             |   1 +
>>>>  lib/dpif-netdev-perf.c      | 333
>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>  lib/dpif-netdev-perf.h      | 239 +++++++++++++++++++++++++++++--
>>>>  lib/dpif-netdev.c           | 177 +++++++++++++++++++++--
>>>>  lib/netdev-dpdk.c           |  13 +-
>>>>  lib/netdev-dpdk.h           |  14 ++
>>>>  lib/netdev-dpif-unixctl.man | 113 +++++++++++++++
>>>>  manpages.mk                 |   2 +
>>>>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>>>>  vswitchd/vswitch.xml        |  12 ++
>>>>  12 files changed, 881 insertions(+), 54 deletions(-)
>>>>  create mode 100644 lib/netdev-dpif-unixctl.man
>>>>
>>>> diff --git a/NEWS b/NEWS
>>>> index 2c28456..743528e 100644
>>>> --- a/NEWS
>>>> +++ b/NEWS
>>>> @@ -44,6 +44,9 @@ Post-v2.8.0
>>>>            if available (for OpenFlow 1.4+).
>>>>     - Userspace datapath:
>>>>       * Output packet batching support.
>>>> +     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
>> single PMD
>>>> +     * Detailed PMD performance metrics available with new command
>>>> +         ovs-appctl dpif-netdev/pmd-perf-show
>>>>     - vswitchd:
>>>>       * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 
>>>> digits.
>>>>       * Configuring a controller, or unconfiguring all controllers, now 
>>>> deletes
>>>> diff --git a/lib/automake.mk b/lib/automake.mk
>>>> index 159319f..d07cbe9 100644
>>>> --- a/lib/automake.mk
>>>> +++ b/lib/automake.mk
>>>> @@ -468,6 +468,7 @@ MAN_FRAGMENTS += \
>>>>    lib/dpctl.man \
>>>>    lib/memory-unixctl.man \
>>>>    lib/netdev-dpdk-unixctl.man \
>>>> +  lib/netdev-dpif-unixctl.man \
>>>>    lib/ofp-version.man \
>>>>    lib/ovs.tmac \
>>>>    lib/service.man \
>>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>>> index b4b721c..3d65088 100644
>>>> --- a/lib/dp-packet.h
>>>> +++ b/lib/dp-packet.h
>>>> @@ -697,6 +697,7 @@ struct dp_packet_batch {
>>>>      size_t count;
>>>>      bool trunc; /* true if the batch needs truncate. */
>>>>      struct dp_packet *packets[NETDEV_MAX_BURST];
>>>> +
>>>>  };
>>>>
>>>>  static inline void
>>>> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
>>>> index f06991a..e0ef15d 100644
>>>> --- a/lib/dpif-netdev-perf.c
>>>> +++ b/lib/dpif-netdev-perf.c
>>>> @@ -15,6 +15,7 @@
>>>>   */
>>>>
>>>>  #include <config.h>
>>>> +#include <stdint.h>
>>>>
>>>>  #include "openvswitch/dynamic-string.h"
>>>>  #include "openvswitch/vlog.h"
>>>> @@ -23,10 +24,299 @@
>>>>
>>>>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
>>>>
>>>> +#ifdef DPDK_NETDEV
>>>> +static uint64_t
>>>> +get_tsc_hz(void)
>>>> +{
>>>> +    return rte_get_tsc_hz();
>>>> +}
>>>> +#else
>>>> +/* This function is only invoked from PMD threads which depend on
>> DPDK.
>>>> + * A dummy function is sufficient when building without DPDK_NETDEV.
>> */
>>>> +static uint64_t
>>>> +get_tsc_hz(void)
>>>> +{
>>>> +    return 1;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/* Histogram functions. */
>>>> +
>>>> +static void
>>>> +histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t
>> max)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    ovs_assert(min < max);
>>>> +    for (i = 0; i < NUM_BINS-1; i++) {
>>>> +        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
>>>> +    }
>>>> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
>>>> +}
>>>> +
>>>> +static void
>>>> +histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t
>> max)
>>>> +{
>>>> +    int i, start, bins, wall;
>>>> +    double log_min, log_max;
>>>> +
>>>> +    ovs_assert(min < max);
>>>> +    if (min > 0) {
>>>> +        log_min = log(min);
>>>> +        log_max = log(max);
>>>> +        start = 0;
>>>> +        bins = NUM_BINS - 1;
>>>> +    } else {
>>>> +        hist->wall[0] = 0;
>>>> +        log_min = log(1);
>>>> +        log_max = log(max);
>>>> +        start = 1;
>>>> +        bins = NUM_BINS - 2;
>>>> +    }
>>>> +    wall = start;
>>>> +    for (i = 0; i < bins; i++) {
>>>> +        /* Make sure each wall is monotonically increasing. */
>>>> +        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-
>> 1)));
>>>> +        hist->wall[start + i] = wall++;
>>>> +    }
>>>> +    if (hist->wall[NUM_BINS-2] < max) {
>>>> +        hist->wall[NUM_BINS-2] = max;
>>>> +    }
>>>> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
>>>> +}
>>>> +
>>>> +uint64_t
>>>> +histogram_samples(const struct histogram *hist)
>>>> +{
>>>> +    uint64_t samples = 0;
>>>> +
>>>> +    for (int i = 0; i < NUM_BINS; i++) {
>>>> +        samples += hist->bin[i];
>>>> +    }
>>>> +    return samples;
>>>> +}
>>>> +
>>>> +static void
>>>> +histogram_clear(struct histogram *hist)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < NUM_BINS; i++) {
>>>> +        hist->bin[i] = 0;
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +history_init(struct history *h)
>>>> +{
>>>> +    memset(h, 0, sizeof(*h));
>>>> +}
>>>> +
>>>>  void
>>>>  pmd_perf_stats_init(struct pmd_perf_stats *s)
>>>>  {
>>>> -    memset(s, 0 , sizeof(*s));
>>>> +    memset(s, 0, sizeof(*s));
>>>> +    histogram_walls_set_log(&s->cycles, 500, 24000000);
>>>> +    histogram_walls_set_log(&s->pkts, 0, 1000);
>>>> +    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
>>>> +    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
>>>> +    histogram_walls_set_lin(&s->upcalls, 0, 30);
>>>> +    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
>>>> +    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
>>>> +    s->start_ms = time_msec();
>>>> +}
>>>> +
>>>> +void
>>>> +pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats
>> *s,
>>>> +                              double duration)
>>>> +{
>>>> +    uint64_t stats[PMD_N_STATS];
>>>> +    double us_per_cycle = 1000000.0 / get_tsc_hz();
>>>> +
>>>> +    if (duration == 0) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    pmd_perf_read_counters(s, stats);
>>>> +    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
>>>> +                          stats[PMD_CYCLES_ITER_BUSY];
>>>> +    uint64_t packets = stats[PMD_STAT_RECV];
>>>> +    uint64_t passes = stats[PMD_STAT_RECV] +
>>>> +                      stats[PMD_STAT_RECIRC];
>>>> +    uint64_t upcalls = stats[PMD_STAT_MISS];
>>>> +    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
>>>> +    uint64_t tot_iter = histogram_samples(&s->pkts);
>>>> +    uint64_t idle_iter = s->pkts.bin[0];
>>>> +    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
>>>> +
>>>> +    ds_put_format(str,
>>>> +            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
>>>> +            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
>>>> +            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
>>>> +            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
>>>> +            tot_cycles, (tot_cycles / duration) / 1E9,
>>>> +            tot_iter, tot_cycles * us_per_cycle / tot_iter,
>>>> +            idle_iter,
>>>> +            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>>>> +            busy_iter,
>>>> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
>>>> +    if (packets > 0) {
>>>> +        ds_put_format(str,
>>>> +            "  Packets:         %12"PRIu64"  (%.0f Kpps, %.0f 
>>>> cycles/pkt)\n"
>>>> +            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
>>>> +            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
>>>> +            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl 
>>>> lookups/"
>>>> +                                                                     
>>>> "hit)\n"
>>>> +            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f us/upcall)\n"
>>>> +            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n"
>>>> +            "\n",
>>>> +            packets, (packets / duration) / 1000,
>>>> +            1.0 * stats[PMD_CYCLES_ITER_BUSY] / packets,
>>>> +            passes, packets ? 1.0 * passes / packets : 0,
>>>> +            stats[PMD_STAT_EXACT_HIT],
>>>> +            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
>>>> +            stats[PMD_STAT_MASKED_HIT],
>>>> +            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
>>>> +            stats[PMD_STAT_MASKED_HIT]
>>>> +            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] /
>> stats[PMD_STAT_MASKED_HIT]
>>>> +            : 0,
>>>> +            upcalls, 100.0 * upcalls / passes,
>>>> +            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
>>>> +            stats[PMD_STAT_LOST],
>>>> +            100.0 * stats[PMD_STAT_LOST] / passes);
>>>> +    } else {
>>>> +        ds_put_format(str,
>>>> +                "  Packets:         %12"PRIu64"\n"
>>>> +                "\n",
>>>> +                0UL);
>>>> +    }
>>>> +}
>>>> +
>>>> +void
>>>> +pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    ds_put_cstr(str, "Histograms\n");
>>>> +    ds_put_format(str,
>>>> +                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
>>>> +                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
>>>> +                  "max vhost qlen", "upcalls/it", "cycles/upcall");
>>>> +    for (i = 0; i < NUM_BINS-1; i++) {
>>>> +        ds_put_format(str,
>>>> +            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
>>>> +            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
>>>> +            "  %-9d %-11"PRIu64"\n",
>>>> +            s->cycles.wall[i], s->cycles.bin[i],
>>>> +            s->pkts.wall[i],s->pkts.bin[i],
>>>> +            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
>>>> +            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
>>>> +            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
>>>> +            s->upcalls.wall[i], s->upcalls.bin[i],
>>>> +            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
>>>> +    }
>>>> +    ds_put_format(str,
>>>> +                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s 
>>>> %-11"PRIu64
>>>> +                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s 
>>>> %-11"PRIu64
>>>> +                  "  %-9s %-11"PRIu64"\n",
>>>> +                  ">", s->cycles.bin[i],
>>>> +                  ">", s->pkts.bin[i],
>>>> +                  ">", s->cycles_per_pkt.bin[i],
>>>> +                  ">", s->pkts_per_batch.bin[i],
>>>> +                  ">", s->max_vhost_qfill.bin[i],
>>>> +                  ">", s->upcalls.bin[i],
>>>> +                  ">", s->cycles_per_upcall.bin[i]);
>>>> +    if (s->totals.iterations > 0) {
>>>> +        ds_put_cstr(str,
>>>> +                    
>>>> "-----------------------------------------------------"
>>>> +                    
>>>> "-----------------------------------------------------"
>>>> +                    "------------------------------------------------\n");
>>>> +        ds_put_format(str,
>>>> +                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  
>>>> %-21s\n",
>>>> +                      "cycles/it", "packets/it", "cycles/pkt", 
>>>> "pkts/batch",
>>>> +                      "vhost qlen", "upcalls/it", "cycles/upcall");
>>>> +        ds_put_format(str,
>>>> +                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
>>>> +                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
>>>> +                      s->totals.cycles / s->totals.iterations,
>>>> +                      1.0 * s->totals.pkts / s->totals.iterations,
>>>> +                      s->totals.pkts
>>>> +                          ? s->totals.busy_cycles / s->totals.pkts : 0,
>>>> +                      s->totals.batches
>>>> +                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
>>>> +                      1.0 * s->totals.max_vhost_qfill / 
>>>> s->totals.iterations,
>>>> +                      1.0 * s->totals.upcalls / s->totals.iterations,
>>>> +                      s->totals.upcalls
>>>> +                          ? s->totals.upcall_cycles / s->totals.upcalls : 
>>>> 0);
>>>> +    }
>>>> +}
>>>> +
>>>> +void
>>>> +pmd_perf_format_iteration_history(struct ds *str, struct
>> pmd_perf_stats *s,
>>>> +                                  int n_iter)
>>>> +{
>>>> +    struct iter_stats *is;
>>>> +    size_t index;
>>>> +    int i;
>>>> +
>>>> +    if (n_iter == 0) {
>>>> +        return;
>>>> +    }
>>>> +    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
>>>> +                  "%-10s   %-10s   %-10s\n",
>>>> +                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
>>>> +                  "vhost qlen", "upcalls", "cycles/upcall");
>>>> +    for (i = 1; i <= n_iter; i++) {
>>>> +        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;
>>>> +        is = &s->iterations.sample[index];
>>>> +        ds_put_format(str,
>>>> +                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
>>>> +                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
>>>> +                      "  %-11"PRIu32"  %-11"PRIu32"\n",
>>>> +                      is->timestamp,
>>>> +                      is->cycles,
>>>> +                      is->pkts,
>>>> +                      is->pkts ? is->cycles / is->pkts : 0,
>>>> +                      is->batches ? is->pkts / is->batches : 0,
>>>> +                      is->max_vhost_qfill,
>>>> +                      is->upcalls,
>>>> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
>>>> +    }
>>>> +}
>>>> +
>>>> +void
>>>> +pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
>> int n_ms)
>>>> +{
>>>> +    struct iter_stats *is;
>>>> +    size_t index;
>>>> +    int i;
>>>> +
>>>> +    if (n_ms == 0) {
>>>> +        return;
>>>> +    }
>>>> +    ds_put_format(str,
>>>> +                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
>>>> +                  "   %-10s   %-10s   %-10s   %-10s\n",
>>>> +                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
>>>> +                  "pkts/batch", "vhost qlen", "upcalls", "cycles/upcall");
>>>> +    for (i = 1; i <= n_ms; i++) {
>>>> +        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;
>>>> +        is = &s->milliseconds.sample[index];
>>>> +        ds_put_format(str,
>>>> +                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
>>>> +                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
>>>> +                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
>>>> +                      is->timestamp,
>>>> +                      is->iterations,
>>>> +                      is->iterations ? is->cycles / is->iterations : 0,
>>>> +                      is->pkts,
>>>> +                      is->pkts ? is->busy_cycles / is->pkts : 0,
>>>> +                      is->batches ? is->pkts / is->batches : 0,
>>>> +                      is->iterations
>>>> +                          ? is->max_vhost_qfill / is->iterations : 0,
>>>> +                      is->upcalls,
>>>> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
>>>> +    }
>>>>  }
>>>>
>>>>  void
>>>> @@ -51,10 +341,49 @@ pmd_perf_read_counters(struct
>> pmd_perf_stats *s,
>>>>      }
>>>>  }
>>>>
>>>> +/* This function is executed in the context of the PMD at the start of
>>>> + * a new iteration when requested through pmd_perf_stats_clear(). */
>>>>  void
>>>> -pmd_perf_stats_clear(struct pmd_perf_stats *s)
>>>> +pmd_perf_stats_clear__(struct pmd_perf_stats *s)
>>>>  {
>>>>      for (int i = 0; i < PMD_N_STATS; i++) {
>>>>          atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
>>>>      }
>>>> +    memset(&s->current, 0 , sizeof(struct iter_stats));
>>>> +    memset(&s->totals, 0 , sizeof(struct iter_stats));
>>>> +    histogram_clear(&s->cycles);
>>>> +    histogram_clear(&s->pkts);
>>>> +    histogram_clear(&s->cycles_per_pkt);
>>>> +    histogram_clear(&s->upcalls);
>>>> +    histogram_clear(&s->cycles_per_upcall);
>>>> +    histogram_clear(&s->pkts_per_batch);
>>>> +    histogram_clear(&s->max_vhost_qfill);
>>>> +    history_init(&s->iterations);
>>>> +    history_init(&s->milliseconds);
>>>> +    s->start_ms = time_msec();
>>>> +    s->milliseconds.sample[0].timestamp = s->start_ms;
>>>> +    /* Clearing finished. */
>>>> +    s->clear = false;
>>>> +}
>>>> +
>>>> +/* This function must be called from outside the PMD thread to safely
>>>> + * clear the PMD stats at the start of the next iteration. It blocks the
>>>> + * caller until the stats are cleared. */
> [[BO'M]] Do we need the "it blocks until..." constraint? This is only called 
> via the cli so once the flag is set then pmd will on the next iteration clear 
> the stats. And they should be long cleared before the next cli operation to 
> view them. In that case we can completely remove the nanosleep below. I don't 
> know if clear needs any atomic protection. I suspect volatile would be enough.
>>>> +void
>>>> +pmd_perf_stats_clear(struct pmd_perf_stats *s)
>>>> +{
>>>> +    /* Request the PMD to clear its stats in pmd_perf_start_iteration(). 
>>>> */
>>>> +    s->clear = true;
>>>> +    /* Wait a number of milliseconds for the stats to be cleared. */
>>>> +    while (s->clear) {
>>>
>>> Handler thread will wait forever in case we have PMD thread without rx
>> queues.
>>> Also we're under '&dp_netdev_mutex' here. All the changes in
>> configuration
>>> will stuck.
>>
>> Is it ok if I state in the comment that it's the caller's responsibility to 
>> make
>> sure the PMD thread is executing when calling the function? I can also add a
>> max waiting time (e.g. 1 second) to avoid eternal deadlock.
>>
>> Let's discuss further down if the invocation is safe.
>>
>>>
>>>> +        xnanosleep(1000 * 1000);
>>>> +    }
>>>> +}
>>>> +
>>>> +/* This function can be called from the anywhere to clear the stats
>>>> + * of the non-pmd thread. */
>>>> +void
>>>> +non_pmd_perf_stats_clear(struct pmd_perf_stats *s)
>>>> +{
>>>> +    pmd_perf_stats_clear__(s);
>>>>  }
>>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>>>> index 5993c25..7a89c40 100644
>>>> --- a/lib/dpif-netdev-perf.h
>>>> +++ b/lib/dpif-netdev-perf.h
>>>> @@ -38,10 +38,18 @@
>>>>  extern "C" {
>>>>  #endif
>>>>
>>>> -/* This module encapsulates data structures and functions to maintain
>> PMD
>>>> - * performance metrics such as packet counters, execution cycles. It
>>>> - * provides a clean API for dpif-netdev to initialize, update and read and
>>>> +/* This module encapsulates data structures and functions to maintain
>> basic PMD
>>>> + * performance metrics such as packet counters, execution cycles as
>> well as
>>>> + * histograms and time series recording for more detailed PMD metrics.
>>>> + *
>>>> + * It provides a clean API for dpif-netdev to initialize, update and read
>> and
>>>>   * reset these metrics.
>>>> + *
>>>> + * The basic set of PMD counters is implemented as atomic_uint64_t
>> variables
>>>> + * to guarantee correct read also in 32-bit systems.
>>>> + *
>>>> + * The detailed PMD performance metrics are only supported on 64-bit
>> systems
>>>> + * with atomic 64-bit read and store semantics for plain uint64_t
>> counters.
>>>>   */
>>>>
>>>>  /* Set of counter types maintained in pmd_perf_stats. */
>>>> @@ -66,6 +74,7 @@ enum pmd_stat_type {
>>>>      PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
>>>>      PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>>>>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>>>> +    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>>>>      PMD_N_STATS
>>>>  };
>>>>
>>>> @@ -81,18 +90,78 @@ struct pmd_counters {
>>>>      uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
>>>>  };
>>>>
>>>> +/* Data structure to collect statistical distribution of an integer
>> measurement
>>>> + * type in form of a histogram. The wall[] array contains the inclusive
>>>> + * upper boundaries of the bins, while the bin[] array contains the actual
>>>> + * counters per bin. The histogram walls are typically set automatically
>>>> + * using the functions provided below.*/
>>>> +
>>>> +#define NUM_BINS 32             /* Number of histogram bins. */
>>>> +
>>>> +struct histogram {
>>>> +    uint32_t wall[NUM_BINS];
>>>> +    uint64_t bin[NUM_BINS];
>>>> +};
>>>> +
>>>> +/* Data structure to record details PMD execution metrics per iteration
>> for
>>>> + * a history period of up to HISTORY_LEN iterations in circular buffer.
>>>> + * Also used to record up to HISTORY_LEN millisecond averages/totals of
>> these
>>>> + * metrics.*/
>>>> +
>>>> +struct iter_stats {
>>>> +    uint64_t timestamp;         /* TSC or millisecond. */
>>>> +    uint64_t cycles;            /* Number of TSC cycles spent in it/ms. */
>>>> +    uint64_t busy_cycles;       /* Cycles spent in busy iterations in ms. 
>>>> */
>>>> +    uint32_t iterations;        /* Iterations in ms. */
>>>> +    uint32_t pkts;              /* Packets processed in iteration/ms. */
>>>> +    uint32_t upcalls;           /* Number of upcalls in iteration/ms. */
>>>> +    uint32_t upcall_cycles;     /* Cycles spent in upcalls in 
>>>> iteration/ms. */
>>>> +    uint32_t batches;           /* Number of rx batches in iteration/ms. 
>>>> */
>>>> +    uint32_t max_vhost_qfill;   /* Maximum fill level encountered in
>> it/ms. */
>>>> +};
>>>> +
>>>> +#define HISTORY_LEN 1000        /* Length of recorded history
>>>> +                                   (iterations and ms). */
>>>> +#define DEF_HIST_SHOW 20        /* Default number of history samples
>> to
>>>> +                                   display. */
>>>> +
>>>> +struct history {
>>>> +    uint64_t idx;               /* Next slot in history. */
>>>> +    struct iter_stats sample[HISTORY_LEN];
>>>> +};
>>>> +
>>>>  /* Container for all performance metrics of a PMD.
>>>>   * Part of the struct dp_netdev_pmd_thread. */
>>>>
>>>>  struct pmd_perf_stats {
>>>> -    /* Start of the current PMD iteration in TSC cycles.*/
>>>> -    uint64_t start_it_tsc;
>>>> +    /* Set by CLI thread to order clearing of PMD stats. */
>>>> +    volatile atomic_bool clear;
>>>
>>> I think, volatile + atomic is too much.
>>> Also, you're not using atomic operations on this variable.
>>
>> So would you prefer volatile bool or atomic_bool?
>>
>> I assume that bool is atomic on all architectures, so use of atomic_bool was
>> more a kind of documenting that.
>> But then I should use also atomic operations to be consistent. See also the
>> other atomic_bool below.
>>
>>>
>>>> +    /* Start of the current performance measurement period. */
>>>> +    uint64_t start_ms;
>>>>      /* Latest TSC time stamp taken in PMD. */
>>>>      uint64_t last_tsc;
>>>> +    /* Used to space certain checks in time. */
>>>> +    uint64_t next_check_tsc;
>>>>      /* If non-NULL, outermost cycle timer currently running in PMD. */
>>>>      struct cycle_timer *cur_timer;
>>>>      /* Set of PMD counters with their zero offsets. */
>>>>      struct pmd_counters counters;
>>>> +    /* Statistics of the current iteration. */
>>>> +    struct iter_stats current;
>>>> +    /* Totals for the current millisecond. */
>>>> +    struct iter_stats totals;
>>>> +    /* Histograms for the PMD metrics. */
>>>> +    struct histogram cycles;
>>>> +    struct histogram pkts;
>>>> +    struct histogram cycles_per_pkt;
>>>> +    struct histogram upcalls;
>>>> +    struct histogram cycles_per_upcall;
>>>> +    struct histogram pkts_per_batch;
>>>> +    struct histogram max_vhost_qfill;
>>>> +    /* Iteration history buffer. */
>>>> +    struct history iterations;
>>>> +    /* Millisecond hitory buffer. */
>>>> +    struct history milliseconds;
>>>>  };
>>>>
>>>>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
>>>> @@ -175,8 +244,15 @@ cycle_timer_stop(struct pmd_perf_stats *s,
>>>>      return now - timer->start;
>>>>  }
>>>>
>>>> +/* Functions to initialize and reset the PMD performance metrics. */
>>>> +
>>>>  void pmd_perf_stats_init(struct pmd_perf_stats *s);
>>>>  void pmd_perf_stats_clear(struct pmd_perf_stats *s);
>>>> +void non_pmd_perf_stats_clear(struct pmd_perf_stats *s);
>>>> +void pmd_perf_stats_clear__(struct pmd_perf_stats *s);
>>>> +
>>>> +/* Functions to read and update PMD counters. */
>>>> +
>>>>  void pmd_perf_read_counters(struct pmd_perf_stats *s,
>>>>                              uint64_t stats[PMD_N_STATS]);
>>>>
>>>> @@ -199,32 +275,175 @@ pmd_perf_update_counter(struct
>> pmd_perf_stats *s,
>>>>      atomic_store_relaxed(&s->counters.n[counter], tmp);
>>>>  }
>>>>
>>>> +/* Functions to manipulate a sample history. */
>>>> +
>>>> +static inline void
>>>> +histogram_add_sample(struct histogram *hist, uint32_t val)
>>>> +{
>>>> +    /* TODO: Can do better with binary search? */
>>>> +    for (int i = 0; i < NUM_BINS-1; i++) {
>>>> +        if (val <= hist->wall[i]) {
>>>> +            hist->bin[i]++;
>>>> +            return;
>>>> +        }
>>>> +    }
>>>> +    hist->bin[NUM_BINS-1]++;
>>>> +}
>>>> +
>>>> +uint64_t histogram_samples(const struct histogram *hist);
>>>> +
>>>> +static inline struct iter_stats *
>>>> +history_current(struct history *h)
>>>> +{
>>>> +    return &h->sample[h->idx];
>>>> +}
>>>> +
>>>> +static inline struct iter_stats *
>>>> +history_next(struct history *h)
>>>> +{
>>>> +    struct iter_stats *next;
>>>> +
>>>> +    h->idx++;
>>>> +    if (h->idx == HISTORY_LEN) {
>>>> +        h->idx = 0;
>>>> +    }
>>>> +    next = &h->sample[h->idx];
>>>> +    memset(next, 0, sizeof(*next));
>>>> +    return next;
>>>> +}
>>>> +
>>>> +static inline struct iter_stats *
>>>> +history_store(struct history *h, struct iter_stats *is)
>>>> +{
>>>> +    if (is) {
>>>> +        h->sample[h->idx] = *is;
>>>> +    }
>>>> +    /* Advance the history pointer */
>>>> +    return history_next(h);
>>>> +}
>>>> +
>>>> +/* Functions recording PMD metrics per iteration. */
>>>> +
>>>>  static inline void
>>>>  pmd_perf_start_iteration(struct pmd_perf_stats *s)
>>>>  {
>>>> +    if (s->clear) {
>>>> +        /* Clear the PMD stats before starting next iteration. */
>>>> +        pmd_perf_stats_clear__(s);
>>>> +    }
>>>> +    /* Initialize the current interval stats. */
>>>> +    memset(&s->current, 0, sizeof(struct iter_stats));
>>>>      if (OVS_LIKELY(s->last_tsc)) {
>>>>          /* We assume here that last_tsc was updated immediately prior at
>>>>           * the end of the previous iteration, or just before the first
>>>>           * iteration. */
>>>> -        s->start_it_tsc = s->last_tsc;
>>>> +        s->current.timestamp = s->last_tsc;
>>>>      } else {
>>>>          /* In case last_tsc has never been set before. */
>>>> -        s->start_it_tsc = cycles_counter_update(s);
>>>> +        s->current.timestamp = cycles_counter_update(s);
>>>>      }
>>>>  }
>>>>
>>>>  static inline void
>>>> -pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
>>>> +pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>>>> +                       int tx_packets, bool full_metrics)
>>>>  {
>>>> -    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
>>>> +    uint64_t now_tsc = cycles_counter_update(s);
>>>> +    struct iter_stats *cum_ms;
>>>> +    uint64_t cycles, cycles_per_pkt = 0;
>>>>
>>>> -    if (rx_packets > 0) {
>>>> +    if (OVS_UNLIKELY(s->current.timestamp == 0)) {
>>>> +        /* Stats were cleared during the ongoing iteration. */
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    cycles = now_tsc - s->current.timestamp;
>>>> +    s->current.cycles = cycles;
>>>> +    s->current.pkts = rx_packets;
>>>> +
>>>> +    if (rx_packets + tx_packets > 0) {
>>>>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
>>>>      } else {
>>>>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
>>>>      }
>>>> +    /* Add iteration samples to histograms. */
>>>> +    histogram_add_sample(&s->cycles, cycles);
>>>> +    histogram_add_sample(&s->pkts, rx_packets);
>>>> +
>>>> +    if (!full_metrics) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
>>>> +
>>>> +    if (rx_packets > 0) {
>>>> +        cycles_per_pkt = cycles / rx_packets;
>>>> +        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
>>>> +    }
>>>> +    if (s->current.batches > 0) {
>>>> +        histogram_add_sample(&s->pkts_per_batch,
>>>> +                             rx_packets / s->current.batches);
>>>> +    }
>>>> +    histogram_add_sample(&s->upcalls, s->current.upcalls);
>>>> +    if (s->current.upcalls > 0) {
>>>> +        histogram_add_sample(&s->cycles_per_upcall,
>>>> +                             s->current.upcall_cycles / 
>>>> s->current.upcalls);
>>>> +    }
>>>> +    histogram_add_sample(&s->max_vhost_qfill, s-
>>> current.max_vhost_qfill);
>>>> +
>>>> +    /* Add iteration samples to millisecond stats. */
>>>> +    cum_ms = history_current(&s->milliseconds);
>>>> +    cum_ms->iterations++;
>>>> +    cum_ms->cycles += cycles;
>>>> +    if (rx_packets > 0) {
>>>> +        cum_ms->busy_cycles += cycles;
>>>> +    }
>>>> +    cum_ms->pkts += s->current.pkts;
>>>> +    cum_ms->upcalls += s->current.upcalls;
>>>> +    cum_ms->upcall_cycles += s->current.upcall_cycles;
>>>> +    cum_ms->batches += s->current.batches;
>>>> +    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
>>>> +
>>>> +    /* Store in iteration history. */
>>>> +    history_store(&s->iterations, &s->current);
>>>> +    if (now_tsc > s->next_check_tsc) {
>>>> +        /* Check if ms is completed and store in milliseconds history. */
>>>> +        uint64_t now = time_msec();
>>>> +        if (now != cum_ms->timestamp) {
>>>> +            /* Add ms stats to totals. */
>>>> +            s->totals.iterations += cum_ms->iterations;
>>>> +            s->totals.cycles += cum_ms->cycles;
>>>> +            s->totals.busy_cycles += cum_ms->busy_cycles;
>>>> +            s->totals.pkts += cum_ms->pkts;
>>>> +            s->totals.upcalls += cum_ms->upcalls;
>>>> +            s->totals.upcall_cycles += cum_ms->upcall_cycles;
>>>> +            s->totals.batches += cum_ms->batches;
>>>> +            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
>>>> +            cum_ms = history_next(&s->milliseconds);
>>>> +            cum_ms->timestamp = now;
>>>> +        }
>>>> +        s->next_check_tsc = now_tsc + 10000;
>>>> +    }
>>>>  }
>>>>
>>>> +/* Functions for formatting the output of commands. */
>>>> +
>>>> +struct pmd_perf_params {
>>>> +    int command_type;
>>>> +    bool histograms;
>>>> +    size_t iter_hist_len;
>>>> +    size_t ms_hist_len;
>>>> +};
>>>> +
>>>> +void pmd_perf_format_overall_stats(struct ds *str, struct
>> pmd_perf_stats *s,
>>>> +                                   double duration);
>>>> +void pmd_perf_format_histograms(struct ds *str, struct
>> pmd_perf_stats *s);
>>>> +void pmd_perf_format_iteration_history(struct ds *str,
>>>> +                                       struct pmd_perf_stats *s,
>>>> +                                       int n_iter);
>>>> +void pmd_perf_format_ms_history(struct ds *str, struct
>> pmd_perf_stats *s,
>>>> +                                int n_ms);
>>>> +
>>>>  #ifdef  __cplusplus
>>>>  }
>>>>  #endif
>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>>> index 48a8ebb..f5931bf 100644
>>>> --- a/lib/dpif-netdev.c
>>>> +++ b/lib/dpif-netdev.c
>>>> @@ -49,6 +49,7 @@
>>>>  #include "id-pool.h"
>>>>  #include "latch.h"
>>>>  #include "netdev.h"
>>>> +#include "netdev-provider.h"
>>>>  #include "netdev-vport.h"
>>>>  #include "netlink.h"
>>>>  #include "odp-execute.h"
>>>> @@ -281,6 +282,8 @@ struct dp_netdev {
>>>>
>>>>      /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
>>>>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t
>> emc_insert_min;
>>>> +    /* Enable collection of PMD performance metrics. */
>>>> +    ATOMIC(bool) pmd_perf_metrics;
>>>
>>> atomic_bool ?
>>
>> Is atomic_bool here an overkill too?
>>
>>>
>>>>
>>>>      /* Protects access to ofproto-dpif-upcall interface during revalidator
>>>>       * thread synchronization. */
>>>> @@ -712,6 +715,8 @@ static inline bool emc_entry_alive(struct
>> emc_entry *ce);
>>>>  static void emc_clear_entry(struct emc_entry *ce);
>>>>
>>>>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>>>> +static inline bool
>>>> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread
>> *pmd);
>>>>
>>>>  static void
>>>>  emc_cache_init(struct emc_cache *flow_cache)
>>>> @@ -795,7 +800,8 @@ get_dp_netdev(const struct dpif *dpif)
>>>>  enum pmd_info_type {
>>>>      PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
>>>>      PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
>>>> -    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
>>>> +    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
>>>> +    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
>>>>  };
>>>>
>>>>  static void
>>>> @@ -886,6 +892,44 @@ pmd_info_show_stats(struct ds *reply,
>>>>                    stats[PMD_CYCLES_ITER_BUSY], total_packets);
>>>>  }
>>>>
>>>> +static void
>>>> +pmd_info_show_perf(struct ds *reply,
>>>> +                   struct dp_netdev_pmd_thread *pmd,
>>>> +                   struct pmd_perf_params *par)
>>>> +{
>>>> +    if (pmd->core_id != NON_PMD_CORE_ID) {
>>>> +        char *time_str =
>>>> +                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
>>>> +        long long now = time_msec();
>>>> +        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
>>>> +
>>>> +        ds_put_cstr(reply, "\n");
>>>> +        ds_put_format(reply, "Time: %s\n", time_str);
>>>> +        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
>>>> +        ds_put_cstr(reply, "\n");
>>>> +        format_pmd_thread(reply, pmd);
>>>> +        ds_put_cstr(reply, "\n");
>>>> +        pmd_perf_format_overall_stats(reply, &pmd->perf_stats,
>> duration);
>>>> +        if (pmd_perf_metrics_enabled(pmd)) {
>>>> +            if (par->histograms) {
>>>> +                ds_put_cstr(reply, "\n");
>>>> +                pmd_perf_format_histograms(reply, &pmd->perf_stats);
>>>> +            }
>>>> +            if (par->iter_hist_len > 0) {
>>>> +                ds_put_cstr(reply, "\n");
>>>> +                pmd_perf_format_iteration_history(reply, &pmd->perf_stats,
>>>> +                        par->iter_hist_len);
>>>> +            }
>>>> +            if (par->ms_hist_len > 0) {
>>>> +                ds_put_cstr(reply, "\n");
>>>> +                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
>>>> +                        par->ms_hist_len);
>>>> +            }
>>>> +        }
>>>> +        free(time_str);
>>>> +    }
>>>> +}
>>>> +
>>>>  static int
>>>>  compare_poll_list(const void *a_, const void *b_)
>>>>  {
>>>> @@ -1088,9 +1132,15 @@ dpif_netdev_pmd_info(struct unixctl_conn
>> *conn, int argc, const char *argv[],
>>>>          if (type == PMD_INFO_SHOW_RXQ) {
>>>>              pmd_info_show_rxq(&reply, pmd);
>>>>          } else if (type == PMD_INFO_CLEAR_STATS) {
>>>> -            pmd_perf_stats_clear(&pmd->perf_stats);
>>>> +            if (pmd->core_id == NON_PMD_CORE_ID) {
>>>> +                non_pmd_perf_stats_clear(&pmd->perf_stats);
>>>> +            } else {
>>>> +                pmd_perf_stats_clear(&pmd->perf_stats);
>>>> +            }
>>>
>>> Can we call clrear() only for PMD threads here? We're not counting stats for
>>> non-pmd anyway.
>>
>> The non-pmd thread does have the basic set of PMD counters for the
>> purpose of pmd-stats-show which need to be cleared at pmd-stats-clear.
>>
>> Coming back to the discussion of invoking the blocking
>> pmd_perf_stats_clear(&pmd->perf_stats) here:
>>
>> This code fragment is executed in the following loop in
>> dpif_netdev_pmd_info():
>>
>>     sorted_poll_thread_list(dp, &pmd_list, &n);
>>     for (size_t i = 0; i < n; i++) {
>>         struct dp_netdev_pmd_thread *pmd = pmd_list[i];
>>         if (!pmd) {
>>             break;
>>         }
>>         if (filter_on_pmd && pmd->core_id != core_id) {
>>             continue;
>>         }
>>         if (type == PMD_INFO_SHOW_RXQ) {
>>             pmd_info_show_rxq(&reply, pmd);
>>         } else if (type == PMD_INFO_CLEAR_STATS) {
>>             if (pmd->core_id == NON_PMD_CORE_ID) {
>>                 non_pmd_perf_stats_clear(&pmd->perf_stats);
>>             } else {
>>                 pmd_perf_stats_clear(&pmd->perf_stats);
>>             }
>>         } else if (type == PMD_INFO_SHOW_STATS) {
>>             pmd_info_show_stats(&reply, pmd);
>>         } else if (type == PMD_INFO_PERF_SHOW) {
>>             pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
>>         }
>>     }
>>     free(pmd_list);
>>
>> We are iterating over a sorted list of pmd pointers from cmap dp-
>>> poll_threads under the protection of the global dp_netdev_mutex lock.
>> Each pmd struct in this cmap (except for the NON_PMD_CORE_ID) is running
>> a pmd thread because the thread is created immediately after inserting the
>> pmd into dp->poll_threads in reconfigure_pmd_threads() and ended
>> immediately before removing the pmd from that cmap in
>> dp_netdev_del_pmd(). Can either of this happen while the
>> dp_netdev_mutext is held. If not then there is not even a potential race
>> condition.
>>
>> So I think the we can be sure that the pmd thread is running when calling
>> pmd_perf_stats_clear() here. Do you agree?

The main issue here is that PMD thread could sleep and never call the
pmd_perf_start_iteration(). PMD thread could wait for reload inside the
following loop if it has no rx queues to poll:

4258     if (!poll_cnt) {                                                       
      
4259         while (seq_read(pmd->reload_seq) == pmd->last_reload_seq) {        
      
4260             seq_wait(pmd->reload_seq, pmd->last_reload_seq);               
      
4261             poll_block();                                                  
      
4262         }                                                                  
      
4263         lc = UINT_MAX;                                                     
      
4264     }

And the main thread will wait forever.


About volatiles and atomics:
'volatile' only forces compiler to use actual loads/stores. But it doesn't
protect from operations' reordering by compiler and definitely not from memory
operations' reordering on cpu.

lets look at pmd_perf_stats_clear__() function:

365 void                                                                        
    
366 pmd_perf_stats_clear__(struct pmd_perf_stats *s)                            
    
367 {                                                                           
    
368     for (int i = 0; i < PMD_N_STATS; i++) {                                 
    
369         atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);       
    
370     }                                                                       
    
371     memset(&s->current, 0 , sizeof(struct iter_stats));                     
    
372     memset(&s->totals, 0 , sizeof(struct iter_stats));                      
    
373     histogram_clear(&s->cycles);                                            
    
374     histogram_clear(&s->pkts);                                              
    
375     histogram_clear(&s->cycles_per_pkt);                                    
    
376     histogram_clear(&s->upcalls);                                           
    
377     histogram_clear(&s->cycles_per_upcall);                                 
    
378     histogram_clear(&s->pkts_per_batch);                                    
    
379     histogram_clear(&s->max_vhost_qfill);                                   
    
380     history_init(&s->iterations);                                           
    
381     history_init(&s->milliseconds);                                         
    
382     s->start_ms = time_msec();                                              
    
383     s->milliseconds.sample[0].timestamp = s->start_ms;                      
    
384     s->log_begin_it = UINT64_MAX;                                           
    
385     s->log_end_it = UINT64_MAX;                                             
    
386     /* Clearing finished. */                                                
    
387     s->clear = false;                                                       
    
388 }

Lets assume that 'clear' is volatile.
's->clear = false;' is the last operation in the code but compiler is able to
move it anywhere in this function. This will break the logic.
To avoid this reordering compiler barrier required, but it will not help
with CPU reordering. In this case for non-x86 architectures real smp write
barrier required.
Mutex will be much better solution.

Lets assume that 'clear' is atomic.
Atomic will provide protection from compiler reordering just like volatile.
But, to prevent reordering on CPU we'll have to use Release-Acquire memory
ordering for atomic operations. Isn't it already a mutex?

Conclusion:
Since you want to protect your shared data from access by other threads you
have to use mutex or not protect it at all (atomics and volatiles are
effectively useless for this use-case unless you're combing them with memory
barriers or heavy memory ordering qualifiers).
In fact that you reading stats directly (non-atomically and without using
volatile access) you will never have accurate data. From that point of view
trying to protect data clearing just to have a bit more accurate stats-show
output looks strange.

Some reading:
https://www.kernel.org/doc/Documentation/process/volatile-considered-harmful.rst


About performance impact of relaxed atomic operations:
On most modern architectures relaxed load/store operations on 32bit integers
on aligned memory will simply become usual operations without any memory fences.
This means that rcu_get operation like get_vid() will cost nothing.
IMHO, it's clear that we should use atomic_bool for 'pmd_perf_metrics' to avoid
any issues with strange architectures. Compiler will make all the dirty work 
for us
without performance impact in general case.


>>
>>>
>>>>          } else if (type == PMD_INFO_SHOW_STATS) {
>>>>              pmd_info_show_stats(&reply, pmd);
>>>> +        } else if (type == PMD_INFO_PERF_SHOW) {
>>>> +            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params
>> *)aux);
>>>>          }
>>>>      }
>>>>      free(pmd_list);
>>>> @@ -1100,6 +1150,48 @@ dpif_netdev_pmd_info(struct unixctl_conn
>> *conn, int argc, const char *argv[],
>>>>      unixctl_command_reply(conn, ds_cstr(&reply));
>>>>      ds_destroy(&reply);
>>>>  }
>>>> +
>>>> +static void
>>>> +pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
>>>> +                          const char *argv[],
>>>> +                          void *aux OVS_UNUSED)
>>>> +{
>>>> +    struct pmd_perf_params par;
>>>> +    long int it_hist = 0, ms_hist = 0;
>>>> +    par.histograms = true;
>>>> +
>>>> +    while (argc > 1) {
>>>> +        if (!strcmp(argv[1], "-nh")) {
>>>> +            par.histograms = false;
>>>> +            argc -= 1;
>>>> +            argv += 1;
>>>> +        } else if (!strcmp(argv[1], "-it") && argc > 2) {
>>>> +            it_hist = strtol(argv[2], NULL, 10);
>>>> +            if (it_hist < 0) {
>>>> +                it_hist = 0;
>>>> +            } else if (it_hist > HISTORY_LEN) {
>>>> +                it_hist = HISTORY_LEN;
>>>> +            }
>>>> +            argc -= 2;
>>>> +            argv += 2;
>>>> +        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
>>>> +            ms_hist = strtol(argv[2], NULL, 10);
>>>> +            if (ms_hist < 0) {
>>>> +                ms_hist = 0;
>>>> +            } else if (ms_hist > HISTORY_LEN) {
>>>> +                ms_hist = HISTORY_LEN;
>>>> +            }
>>>> +            argc -= 2;
>>>> +            argv += 2;
>>>> +        } else {
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +    par.iter_hist_len = it_hist;
>>>> +    par.ms_hist_len = ms_hist;
>>>> +    par.command_type = PMD_INFO_PERF_SHOW;
>>>> +    dpif_netdev_pmd_info(conn, argc, argv, &par);
>>>> +}
>>>>
>>
>>>>  static int
>>>>  dpif_netdev_init(void)
>>>> @@ -1117,6 +1209,12 @@ dpif_netdev_init(void)
>>>>      unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd
>> core] [dp]",
>>>>                               0, 3, dpif_netdev_pmd_info,
>>>>                               (void *)&poll_aux);
>>>> +    unixctl_command_register("dpif-netdev/pmd-perf-show",
>>>> +                             "[-nh] [-it iter-history-len]"
>>>> +                             " [-ms ms-history-len]"
>>>> +                             " [-pmd core | dp]",
>>>> +                             0, 7, pmd_perf_show_cmd,
>>>> +                             NULL);
>>>>      unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
>>>>                               0, 1, dpif_netdev_pmd_rebalance,
>>>>                               NULL);
>>>> @@ -3003,6 +3101,18 @@ dpif_netdev_set_config(struct dpif *dpif,
>> const struct smap *other_config)
>>>>          }
>>>>      }
>>>>
>>>> +    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-
>> metrics", false);
>>>> +    bool cur_perf_enabled;
>>>> +    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
>>>> +    if (perf_enabled != cur_perf_enabled) {
>>>> +        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
>>>> +        if (perf_enabled) {
>>>> +            VLOG_INFO("PMD performance metrics collection enabled");
>>>> +        } else {
>>>> +            VLOG_INFO("PMD performance metrics collection disabled");
>>>> +        }
>>>> +    }
>>>> +
>>>>      return 0;
>>>>  }
>>>>
>>>> @@ -3139,6 +3249,20 @@ dp_netdev_rxq_set_cycles(struct
>> dp_netdev_rxq *rx,
>>>>     atomic_store_relaxed(&rx->cycles[type], cycles);
>>>>  }
>>>>
>>>> +static inline bool
>>>> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread
>> *pmd)
>>>> +{
>>>> +    /* If stores and reads of 64-bit integers are not atomic, the
>>>> +     * full PMD performance metrics are not available as locked
>>>> +     * access to 64 bit integers would be prohibitively expensive. */
>>>> +    if (sizeof(uint64_t) > sizeof(void *)) {
>>>> +        return false;
>>>> +    }
>>>> +    bool pmd_perf_enabled;
>>>> +    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics,
>> &pmd_perf_enabled);
>>>> +    return pmd_perf_enabled;
>>>> +}
>>>> +
>>>>  static void
>>>>  dp_netdev_rxq_add_cycles(struct dp_netdev_rxq *rx,
>>>>                           enum rxq_cycles_counter_type type,
>>>> @@ -3247,10 +3371,11 @@ dp_netdev_process_rxq_port(struct
>> dp_netdev_pmd_thread *pmd,
>>>>                             struct dp_netdev_rxq *rxq,
>>>>                             odp_port_t port_no)
>>>>  {
>>>> +    struct pmd_perf_stats *s = &pmd->perf_stats;
>>>>      struct dp_packet_batch batch;
>>>>      struct cycle_timer timer;
>>>>      int error;
>>>> -    int batch_cnt = 0, output_cnt = 0;
>>>> +    int batch_cnt = 0;
>>>>      uint64_t cycles;
>>>>
>>>>      /* Measure duration for polling and processing rx burst. */
>>>> @@ -3264,15 +3389,34 @@ dp_netdev_process_rxq_port(struct
>> dp_netdev_pmd_thread *pmd,
>>>>          /* At least one packet received. */
>>>>          *recirc_depth_get() = 0;
>>>>          pmd_thread_ctx_time_update(pmd);
>>>> -
>>>>          batch_cnt = batch.count;
>>>> +        if (pmd_perf_metrics_enabled(pmd)) {
>>>> +            /* Update batch histogram. */
>>>> +            s->current.batches++;
>>>> +            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
>>>> +            /* Update the maximum Rx queue fill level. */
>>>> +            int dev_type = netdev_dpdk_get_type(
>>>> +                                netdev_rxq_get_netdev(rxq->rx));
>>>> +            if (dev_type == DPDK_DEV_VHOST) {
>>>> +                /* Check queue fill level for vhostuser ports. */
>>>> +                uint32_t qfill = batch_cnt;
>>>> +                if (OVS_UNLIKELY(batch_cnt == NETDEV_MAX_BURST)) {
>>>> +                    /* Likely more packets in rxq. */
>>>> +                    qfill += netdev_rxq_length(rxq->rx);
>>>> +                }
>>>> +                if (qfill > s->current.max_vhost_qfill) {
>>>> +                    s->current.max_vhost_qfill = qfill;
>>>> +                }
>>>> +            }
>>>> +        }
>>>
>>>
>>> Oh. This is still ugly.
>>> My suggestion:
>>> Remove the implementation of 'netdev_rxq_length' for all ports except
>>> vhostuser* . After that you'll be able to check the type of netdev just
>>> by checking the result of this function and move all the netdev-dpdk
>>> internals back to netdev-dpdk:
>>> -----------------------------------------------------------------------
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>> index b493216..83446f7 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -3400,17 +3400,14 @@ dp_netdev_process_rxq_port(struct
>> dp_netdev_pmd_thread *pmd,
>>>              s->current.batches++;
>>>              histogram_add_sample(&s->pkts_per_batch, batch_cnt);
>>>              /* Update the maximum Rx queue fill level. */
>>> -            int dev_type = netdev_dpdk_get_type(
>>> -                                netdev_rxq_get_netdev(rxq->rx));
>>> -            if (dev_type == DPDK_DEV_VHOST) {
>>> -                /* Check queue fill level for vhostuser ports. */
>>> -                uint32_t qfill = batch_cnt;
>>> -                if (OVS_UNLIKELY(batch_cnt == NETDEV_MAX_BURST)) {
>>> -                    /* Likely more packets in rxq. */
>>> -                    qfill += netdev_rxq_length(rxq->rx);
>>> -                }
>>> -                if (qfill > s->current.max_vhost_qfill) {
>>> -                    s->current.max_vhost_qfill = qfill;
>>> +            if (OVS_UNLIKELY(batch_cnt == NETDEV_MAX_BURST)) {
>>> +                /* Likely more packets in rxq. */
>>> +                int res = netdev_rxq_length(rxq->rx);
>>> +
>>> +                /* Currently, netdev_rxq_length() implemented only for 
>>> vhost
>>> +                 * ports. So, we don't need to check netdev type here. */
>>> +                if (res > 0 && batch_cnt + res > 
>>> s->current.max_vhost_qfill) {
>>> +                    s->current.max_vhost_qfill = batch_cnt + res;
>>>                  }
>>>              }
>>>          }
>>> -----------------------------------------------------------------------
>>>
>>> 'netdev_rxq_length()' not used anyway and probably will never be used.
>>>
>>> Suggestion 2:
>>> How about to rename "max_vhost_qfill" --> "max_qfill" ? This will make
>> code
>>> more generic. In this case we could remove explanations from above code.
>>>
>>
>> I am afraid this little "ugliness" is unavoidable.It is the explicit purpose 
>> of the
>> PMD performance metrics and supervision feature to monitor only the
>> vhostuser queue length:
>> 1. Physical port rx queues are much longer (2K default vs 128 with stock
>> Qemu and virtio drivers) so mixing them would obfuscate the key data we
>> are interested in.
>> 2. There is little value in supervising phy port rx queues for excessive 
>> queue
>> fill level as the NICs provide rx drop counters in contrast to vhostuser 
>> (until
>> hopefully a future virtio standard provides a drop counter and virtio drivers
>> make use of it to report dropped tx packets)
>> 3. Removing the explicit netdev type check in dpif-netdev as you suggest
>> would base the correct working of the PMD performance metrics and
>> supervision on design decisions taken in the netdev complex. Somebody
>> implementing rxq_length() for the dpdk_class or dpdk_ring_class for
>> whatever reason would break PMD performance metrics inadvertedly. I
>> don't think we want such implicit coupling and we don't want to specifically
>> reserve this netdev rxq operation for the purpose of PMD metrics either.

We can left a warning inside dpdk_class that will describe the situation to
prevent breaking if someone will want to implement rxq_length() for it.
But I don't see any reasonable use case for that.

>>
> [[BO'M]] I was going to agree with the point about not breaking the netdev 
> abstraction here. But those points are convincing that in this case we do 
> need to peek behind the abstract interface and only call rxq_length for vhost 
> queues.
> 
>>>> +        /* Process packet batch. */
>>>>          dp_netdev_input(pmd, &batch, port_no);
>>>>
>>>>          /* Assign processing cycles to rx queue. */
>>>>          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
>>>>          dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
>>>>
>>>> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
>>>> +        dp_netdev_pmd_flush_output_packets(pmd, false);
>>>>      } else {
>>>>          /* Discard cycles. */
>>>>          cycle_timer_stop(&pmd->perf_stats, &timer);
>>>> @@ -3286,7 +3430,7 @@ dp_netdev_process_rxq_port(struct
>> dp_netdev_pmd_thread *pmd,
>>>>
>>>>      pmd->ctx.last_rxq = NULL;
>>>>
>>>> -    return batch_cnt + output_cnt;
>>>> +    return batch_cnt;
>>>>  }
>>>>
>>>>  static struct tx_port *
>>>> @@ -4119,22 +4263,23 @@ reload:
>>>>
>>>>      cycles_counter_update(s);
>>>>      for (;;) {
>>>> -        uint64_t iter_packets = 0;
>>>> +        uint64_t rx_packets = 0, tx_packets = 0;
>>>>
>>>>          pmd_perf_start_iteration(s);
>>>> +
>>>>          for (i = 0; i < poll_cnt; i++) {
>>>>              process_packets =
>>>>                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
>>>>                                             poll_list[i].port_no);
>>>> -            iter_packets += process_packets;
>>>> +            rx_packets += process_packets;
>>>>          }
>>>>
>>>> -        if (!iter_packets) {
>>>> +        if (!rx_packets) {
>>>>              /* We didn't receive anything in the process loop.
>>>>               * Check if we need to send something.
>>>>               * There was no time updates on current iteration. */
>>>>              pmd_thread_ctx_time_update(pmd);
>>>> -            iter_packets += dp_netdev_pmd_flush_output_packets(pmd,
>> false);
>>>> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
>> false);
>>>>          }
>>>>
>>>>          if (lc++ > 1024) {
>>>> @@ -4153,7 +4298,8 @@ reload:
>>>>                  break;
>>>>              }
>>>>          }
>>>> -        pmd_perf_end_iteration(s, iter_packets);
>>>> +        pmd_perf_end_iteration(s, rx_packets, tx_packets,
>>>> +                               pmd_perf_metrics_enabled(pmd));
>>>>      }
>>>>
>>>>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>>>> @@ -5050,6 +5196,7 @@ handle_packet_upcall(struct
>> dp_netdev_pmd_thread *pmd,
>>>>      struct match match;
>>>>      ovs_u128 ufid;
>>>>      int error;
>>>> +    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
>>>>
>>>>      match.tun_md.valid = false;
>>>>      miniflow_expand(&key->mf, &match.flow);
>>>> @@ -5103,6 +5250,14 @@ handle_packet_upcall(struct
>> dp_netdev_pmd_thread *pmd,
>>>>          ovs_mutex_unlock(&pmd->flow_mutex);
>>>>          emc_probabilistic_insert(pmd, key, netdev_flow);
>>>>      }
>>>> +    if (pmd_perf_metrics_enabled(pmd)) {
>>>> +        /* Update upcall stats. */
>>>> +        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
>>>> +        struct pmd_perf_stats *s = &pmd->perf_stats;
>>>> +        s->current.upcalls++;
>>>> +        s->current.upcall_cycles += cycles;
>>>> +        histogram_add_sample(&s->cycles_per_upcall, cycles);
>>>> +    }
>>>>      return error;
>>>>  }
>>>>
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index 4200556..7a8fdc2 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -36,6 +36,7 @@
>>>>  #include <rte_mbuf.h>
>>>>  #include <rte_meter.h>
>>>>  #include <rte_pci.h>
>>>> +#include <rte_version.h>
>>>>  #include <rte_vhost.h>
>>>>  #include <rte_version.h>
>>>>
>>>> @@ -188,11 +189,6 @@ enum { DPDK_RING_SIZE = 256 };
>>>>  BUILD_ASSERT_DECL(IS_POW2(DPDK_RING_SIZE));
>>>>  enum { DRAIN_TSC = 200000ULL };
>>>>
>>>> -enum dpdk_dev_type {
>>>> -    DPDK_DEV_ETH = 0,
>>>> -    DPDK_DEV_VHOST = 1,
>>>> -};
>>>> -
>>>>  /* Quality of Service */
>>>>
>>>>  /* An instance of a QoS configuration.  Always associated with a
>> particular
>>>> @@ -843,6 +839,13 @@ netdev_dpdk_cast(const struct netdev *netdev)
>>>>      return CONTAINER_OF(netdev, struct netdev_dpdk, up);
>>>>  }
>>>>
>>>> +enum dpdk_dev_type
>>>> +netdev_dpdk_get_type(const struct netdev *netdev)
>>>> +{
>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> +    return dev->type;
>>>> +}
>>>> +
>>>>  static struct netdev *
>>>>  netdev_dpdk_alloc(void)
>>>>  {
>>>> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
>>>> index b7d02a7..2b357db 100644
>>>> --- a/lib/netdev-dpdk.h
>>>> +++ b/lib/netdev-dpdk.h
>>>> @@ -22,11 +22,18 @@
>>>>  #include "openvswitch/compiler.h"
>>>>
>>>>  struct dp_packet;
>>>> +struct netdev;
>>>> +
>>>> +enum dpdk_dev_type {
>>>> +    DPDK_DEV_ETH = 0,
>>>> +    DPDK_DEV_VHOST = 1,
>>>> +};
>>>>
>>>>  #ifdef DPDK_NETDEV
>>>>
>>>>  void netdev_dpdk_register(void);
>>>>  void free_dpdk_buf(struct dp_packet *);
>>>> +enum dpdk_dev_type netdev_dpdk_get_type(const struct netdev
>> *netdev);
>>>>
>>>>  #else
>>>>
>>>> @@ -41,6 +48,13 @@ free_dpdk_buf(struct dp_packet *buf
>> OVS_UNUSED)
>>>>      /* Nothing */
>>>>  }
>>>>
>>>> +static inline enum dpdk_dev_type
>>>> +netdev_dpdk_get_type(const struct netdev *netdev OVS_UNUSED)
>>>> +{
>>>> +    /* Nothing to do. Return value zero to make compiler happy. */
>>>> +    return DPDK_DEV_ETH;
>>>> +}
>>>> +
>>>>  #endif
>>>>
>>>>  #endif /* netdev-dpdk.h */
>>>> diff --git a/lib/netdev-dpif-unixctl.man b/lib/netdev-dpif-unixctl.man
>>>> new file mode 100644
>>>> index 0000000..53f4c51
>>>> --- /dev/null
>>>> +++ b/lib/netdev-dpif-unixctl.man
>>>> @@ -0,0 +1,113 @@
>>>> +.SS "DPIF-NETDEV COMMANDS"
>>>> +These commands are used to expose internal information (mostly
>> statistics)
>>>> +about the "dpif-netdev" userspace datapath. If there is only one
>> datapath
>>>> +(as is often the case, unless \fBdpctl/\fR commands are used), the
>> \fIdp\fR
>>>> +argument can be omitted. By default the commands present data for all
>> pmd
>>>> +threads in the datapath. By specifying the "-pmd Core" option one can
>> filter
>>>> +the output for a single pmd in the datapath.
>>>> +.
>>>> +.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR]
>> [\fIdp\fR]"
>>>> +Shows performance statistics for one or all pmd threads of the datapath
>>>> +\fIdp\fR. The special thread "main" sums up the statistics of every non
>> pmd
>>>> +thread.
>>>> +
>>>> +The sum of "emc hits", "masked hits" and "miss" is the number of
>>>> +packet lookups performed by the datapath. Beware that a recirculated
>> packet
>>>> +experiences one additional lookup per recirculation, so there may be
>>>> +more lookups than forwarded packets in the datapath.
>>>> +
>>>> +Cycles are counted using the TSC or similar facilities (when available on
>>>> +the platform). The duration of one cycle depends on the processing
>> platform.
>>>> +
>>>> +"idle cycles" refers to cycles spent in PMD iterations not forwarding any
>>>> +any packets. "processing cycles" refers to cycles spent in PMD iterations
>>>> +forwarding at least one packet, including the cost for polling, processing
>> and
>>>> +transmitting said packets.
>>>> +
>>>> +To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
>>>> +.
>>>> +.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR
>> \fIiter_len\fR] \
>>>> +[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
>>>> +Shows detailed performance metrics for one or all pmds threads of the
>>>> +user space datapath.
>>>> +
>>>> +The collection of detailed statistics can be controlled by a new
>>>> +configuration parameter "other_config:pmd-perf-metrics". By default it
>>>> +is disabled. The run-time overhead, when enabled, is in the order of 1%.
>>>> +
>>>> +The covered metrics per iteration are:
>>>> +  - used cycles
>>>> +  - forwared packets
>>>> +  - number of rx batches
>>>> +  - packets/rx batch
>>>> +  - max. vhostuser queue fill level
>>>> +  - number of upcalls
>>>> +  - cycles spent in upcalls
>>>> +
>>>> +This raw recorded data is used threefold:
>>>> +
>>>> +1. In histograms for each of the following metrics:
>>>> +   - cycles/iteration (logarithmic)
>>>> +   - packets/iteration (logarithmic)
>>>> +   - cycles/packet
>>>> +   - packets/batch
>>>> +   - max. vhostuser qlen (logarithmic)
>>>> +   - upcalls
>>>> +   - cycles/upcall (logarithmic)
>>>> +   The histograms bins are divided linear or logarithmic.
>>>> +
>>>> +2. A cyclic history of the above metrics for 1024 iterations
>>>> +
>>>> +3. A cyclic history of the cummulative/average values per millisecond
>>>> +   wall clock for the last 1024 milliseconds:
>>>> +   - number of iterations
>>>> +   - avg. cycles/iteration
>>>> +   - packets (Kpps)
>>>> +   - avg. packets/batch
>>>> +   - avg. max vhost qlen
>>>> +   - upcalls
>>>> +   - avg. cycles/upcall
>>>> +
>>>> +The command options are
>>>> +
>>>> +    \fB-nh\fR:                  Suppress the histograms
>>>> +    \fB-it\fR \fIiter_len\fR:   Display the last iter_len iteration stats
>>>> +    \fB-ms\fR \fIms_len\fR:     Display the last ms_len millisecond stats
>>>> +
>>>> +The output always contains the following global PMD statistics:
>>>> +
>>>> +Time: 15:24:55.270 .br
>>>> +Measurement duration: 1.008 s
>>>> +
>>>> +pmd thread numa_id 0 core_id 1:
>>>> +
>>>> +  Cycles:            2419034712  (2.40 GHz)
>>>> +  Iterations:            572817  (1.76 us/it)
>>>> +  - idle:                486808  (15.9 % cycles)
>>>> +  - busy:                 86009  (84.1 % cycles)
>>>> +  Packets:              2399607  (2381 Kpps, 848 cycles/pkt)
>>>> +  Datapath passes:      3599415  (1.50 passes/pkt)
>>>> +  - EMC hits:            336472  ( 9.3 %)
>>>> +  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>>>> +  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>>>> +  - Lost upcalls:             0  ( 0.0 %)
>>>> +
>>>> +Here "Packets" actually reflects the number of packets forwarded by
>> the
>>>> +datapath. "Datapath passes" matches the number of packet lookups as
>>>> +reported by the \fBdpif-netdev/pmd-stats-show\fR command.
>>>> +
>>>> +To reset the counters and start a new measurement use
>>>> +\fBdpif-netdev/pmd-stats-clear\fR.
>>>> +.
>>>> +.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
>>>> +Resets to zero the per pmd thread performance numbers shown by the
>>>> +\fBdpif-netdev/pmd-stats-show\fR and \Bdpif-netdev/pmd-perf-show
>> \fR commands.
>>>> +It will NOT reset datapath or bridge statistics, only the values shown by
>>>> +the above commands.
>>>> +.
>>>> +.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR]
>> [\fIdp\fR]"
>>>> +For one or all pmd threads of the datapath \fIdp\fR show the list of
>> queue-ids
>>>> +with port names, which this thread polls.
>>>> +.
>>>> +.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
>>>> +Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current
>> usage.
>>>> diff --git a/manpages.mk b/manpages.mk
>>>> index 351155f..9af2fa8 100644
>>>> --- a/manpages.mk
>>>> +++ b/manpages.mk
>>>> @@ -256,6 +256,7 @@ vswitchd/ovs-vswitchd.8: \
>>>>    lib/dpctl.man \
>>>>    lib/memory-unixctl.man \
>>>>    lib/netdev-dpdk-unixctl.man \
>>>> +  lib/netdev-dpif-unixctl.man \
>>>>    lib/service.man \
>>>>    lib/ssl-bootstrap.man \
>>>>    lib/ssl.man \
>>>> @@ -272,6 +273,7 @@ lib/daemon.man:
>>>>  lib/dpctl.man:
>>>>  lib/memory-unixctl.man:
>>>>  lib/netdev-dpdk-unixctl.man:
>>>> +lib/netdev-dpif-unixctl.man:
>>>>  lib/service.man:
>>>>  lib/ssl-bootstrap.man:
>>>>  lib/ssl.man:
>>>> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
>>>> index 80e5f53..7e4714a 100644
>>>> --- a/vswitchd/ovs-vswitchd.8.in
>>>> +++ b/vswitchd/ovs-vswitchd.8.in
>>>> @@ -256,32 +256,7 @@ type).
>>>>  ..
>>>>  .so lib/dpctl.man
>>>>  .
>>>> -.SS "DPIF-NETDEV COMMANDS"
>>>> -These commands are used to expose internal information (mostly
>> statistics)
>>>> -about the ``dpif-netdev'' userspace datapath. If there is only one
>> datapath
>>>> -(as is often the case, unless \fBdpctl/\fR commands are used), the
>> \fIdp\fR
>>>> -argument can be omitted.
>>>> -.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
>>>> -Shows performance statistics for each pmd thread of the datapath
>> \fIdp\fR.
>>>> -The special thread ``main'' sums up the statistics of every non pmd
>> thread.
>>>> -The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
>>>> -packets received by the datapath.  Cycles are counted using the TSC or
>> similar
>>>> -facilities (when available on the platform).  To reset these counters use
>>>> -\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends
>> on the
>>>> -measuring infrastructure. ``idle cycles'' refers to cycles spent polling
>>>> -devices but not receiving any packets. ``processing cycles'' refers to
>> cycles
>>>> -spent polling devices and successfully receiving packets, plus the cycles
>>>> -spent processing said packets.
>>>> -.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
>>>> -Resets to zero the per pmd thread performance numbers shown by the
>>>> -\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset
>> datapath or
>>>> -bridge statistics, only the values shown by the above command.
>>>> -.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
>>>> -For each pmd thread of the datapath \fIdp\fR shows list of queue-ids
>> with
>>>> -port names, which this thread polls.
>>>> -.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
>>>> -Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current
>> usage.
>>>> -.
>>>> +.so lib/netdev-dpif-unixctl.man
>>>>  .so lib/netdev-dpdk-unixctl.man
>>>>  .so ofproto/ofproto-dpif-unixctl.man
>>>>  .so ofproto/ofproto-unixctl.man
>>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>>> index 61fb7b1..3aa8e8e 100644
>>>> --- a/vswitchd/vswitch.xml
>>>> +++ b/vswitchd/vswitch.xml
>>>> @@ -375,6 +375,18 @@
>>>>          </p>
>>>>        </column>
>>>>
>>>> +      <column name="other_config" key="pmd-perf-metrics"
>>>> +              type='{"type": "boolean"}'>
>>>> +        <p>
>>>> +          Enables recording of detailed PMD performance metrics for
>> analysis
>>>> +          and trouble-shooting. This can have a performance impact in the
>>>> +          order of 1%.
>>>> +        </p>
>>>> +        <p>
>>>> +          Defaults to false but can be changed at any time.
>>>> +        </p>
>>>> +      </column>
>>>> +
>>>>        <column name="other_config" key="n-handler-threads"
>>>>                type='{"type": "integer", "minInteger": 1}'>
>>>>          <p>
>>>>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to