Re: [ovs-dev] [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-20 Thread O Mahony, Billy
Acked-by: Billy O'Mahony 

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Thursday, April 19, 2018 6:41 PM
> To: d...@openvswitch.org
> Cc: ktray...@redhat.com; Stokes, Ian ;
> i.maxim...@samsung.com; O Mahony, Billy ; Jan
> Scheurich 
> Subject: [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD
> iterations
> 
> This patch enhances dpif-netdev-perf to detect iterations with suspicious
> statistics according to the following criteria:
> 
> - iteration lasts longer than US_THR microseconds (default 250).
>   This can be used to capture events where a PMD is blocked or
>   interrupted for such a period of time that there is a risk for
>   dropped packets on any of its Rx queues.
> 
> - max vhost qlen exceeds a threshold Q_THR (default 128). This can
>   be used to infer virtio queue overruns and dropped packets inside
>   a VM, which are not visible in OVS otherwise.
> 
> Such suspicious iterations can be logged together with their iteration 
> statistics
> to be able to correlate them to packet drop or other events outside OVS.
> 
> A new command is introduced to enable/disable logging at run-time and to
> adjust the above thresholds for suspicious iterations:
> 
> ovs-appctl dpif-netdev/pmd-perf-log-set on | off
> [-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]
> 
> Turn logging on or off at run-time (on|off).
> 
> -b before:  The number of iterations before the suspicious iteration to
> be logged (default 5).
> -a after:   The number of iterations after the suspicious iteration to
> be logged (default 5).
> -e: Extend logging interval if another suspicious iteration is
> detected before logging occurs.
> -ne:Do not extend logging interval (default).
> -q qlen:Suspicious vhost queue fill level threshold. Increase this
> to 512 if the Qemu supports 1024 virtio queue length.
> (default 128).
> -us usec:   change the duration threshold for a suspicious iteration
> (default 250 us).
> 
> Note: Logging of suspicious iterations itself consumes a considerable amount 
> of
> processing cycles of a PMD which may be visible in the iteration history. In 
> the
> worst case this can lead OVS to detect another suspicious iteration caused by
> logging.
> 
> If more than 100 iterations around a suspicious iteration have been logged 
> once,
> OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that 
> logging
> itself causes continuos further logging.
> 
> Signed-off-by: Jan Scheurich 
> Acked-by: Billy O'Mahony 
> ---
>  NEWS|   2 +
>  lib/dpif-netdev-perf.c  | 223
> 
>  lib/dpif-netdev-perf.h  |  21 +
>  lib/dpif-netdev-unixctl.man |  59 
>  lib/dpif-netdev.c   |   5 +
>  5 files changed, 310 insertions(+)
> 
> diff --git a/NEWS b/NEWS
> index a665c7f..7259492 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -27,6 +27,8 @@ Post-v2.9.0
>   * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single
> PMD
>   * Detailed PMD performance metrics available with new command
>   ovs-appctl dpif-netdev/pmd-perf-show
> + * Supervision of PMD performance metrics and logging of suspicious
> +   iterations
> 
>  v2.9.0 - 19 Feb 2018
>  
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> caa0e27..47ce2c2 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -25,6 +25,24 @@
> 
>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
> 
> +#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
> +   in microseconds. */
> +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
> +#define LOG_IT_BEFORE 5 /* Number of iterations to log before
> +   suspicious iteration. */
> +#define LOG_IT_AFTER 5  /* Number of iterations to log after
> +   suspicious iteration. */
> +
> +bool log_enabled = false;
> +bool log_extend = false;
> +static uint32_t log_it_before = LOG_IT_BEFORE; static uint32_t
> +log_it_after = LOG_IT_AFTER; static uint32_t log_us_thr =
> +ITER_US_THRESHOLD; uint32_t log_q_thr = VHOST_QUEUE_FULL; uint64_t
> +iter_cycle_threshold;
> +
> +static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600,
> +600);
> +
>  #ifdef DPDK_NETDEV
>  static uint64_t
>  get_tsc_hz(void)
> @@ -141,6 +159,10 @@ pmd_perf_stats_init(struct pmd_perf_stats *s)
>  histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
>  s->iteration_cnt = 0;
>  s->start_ms = time_msec();
> +s->log_susp_it = UINT32_MAX;
> +s->log_begin_it = UINT32_MAX;
> +s->log_end_it = UINT32_MAX;
> +s->log_reason = NULL;
>  }
> 
>  void
> @@ -391,6 +413,10 @@ pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
>  history_init(&s->milliseconds);
>  s->st

[ovs-dev] [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-20 Thread Jan Scheurich
This patch enhances dpif-netdev-perf to detect iterations with
suspicious statistics according to the following criteria:

- iteration lasts longer than US_THR microseconds (default 250).
  This can be used to capture events where a PMD is blocked or
  interrupted for such a period of time that there is a risk for
  dropped packets on any of its Rx queues.

- max vhost qlen exceeds a threshold Q_THR (default 128). This can
  be used to infer virtio queue overruns and dropped packets inside
  a VM, which are not visible in OVS otherwise.

Such suspicious iterations can be logged together with their iteration
statistics to be able to correlate them to packet drop or other events
outside OVS.

A new command is introduced to enable/disable logging at run-time and
to adjust the above thresholds for suspicious iterations:

ovs-appctl dpif-netdev/pmd-perf-log-set on | off
[-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]

Turn logging on or off at run-time (on|off).

-b before:  The number of iterations before the suspicious iteration to
be logged (default 5).
-a after:   The number of iterations after the suspicious iteration to
be logged (default 5).
-e: Extend logging interval if another suspicious iteration is
detected before logging occurs.
-ne:Do not extend logging interval (default).
-q qlen:Suspicious vhost queue fill level threshold. Increase this
to 512 if the Qemu supports 1024 virtio queue length.
(default 128).
-us usec:   change the duration threshold for a suspicious iteration
(default 250 us).

Note: Logging of suspicious iterations itself consumes a considerable
amount of processing cycles of a PMD which may be visible in the iteration
history. In the worst case this can lead OVS to detect another
suspicious iteration caused by logging.

If more than 100 iterations around a suspicious iteration have been
logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne)
to avoid that logging itself causes continuos further logging.

Signed-off-by: Jan Scheurich 
Acked-by: Billy O'Mahony 
---
 NEWS|   2 +
 lib/dpif-netdev-perf.c  | 223 
 lib/dpif-netdev-perf.h  |  21 +
 lib/dpif-netdev-unixctl.man |  59 
 lib/dpif-netdev.c   |   5 +
 5 files changed, 310 insertions(+)

diff --git a/NEWS b/NEWS
index a665c7f..7259492 100644
--- a/NEWS
+++ b/NEWS
@@ -27,6 +27,8 @@ Post-v2.9.0
  * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
  * Detailed PMD performance metrics available with new command
  ovs-appctl dpif-netdev/pmd-perf-show
+ * Supervision of PMD performance metrics and logging of suspicious
+   iterations
 
 v2.9.0 - 19 Feb 2018
 
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index caa0e27..47ce2c2 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -25,6 +25,24 @@
 
 VLOG_DEFINE_THIS_MODULE(pmd_perf);
 
+#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
+   in microseconds. */
+#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
+#define LOG_IT_BEFORE 5 /* Number of iterations to log before
+   suspicious iteration. */
+#define LOG_IT_AFTER 5  /* Number of iterations to log after
+   suspicious iteration. */
+
+bool log_enabled = false;
+bool log_extend = false;
+static uint32_t log_it_before = LOG_IT_BEFORE;
+static uint32_t log_it_after = LOG_IT_AFTER;
+static uint32_t log_us_thr = ITER_US_THRESHOLD;
+uint32_t log_q_thr = VHOST_QUEUE_FULL;
+uint64_t iter_cycle_threshold;
+
+static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600, 600);
+
 #ifdef DPDK_NETDEV
 static uint64_t
 get_tsc_hz(void)
@@ -141,6 +159,10 @@ pmd_perf_stats_init(struct pmd_perf_stats *s)
 histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
 s->iteration_cnt = 0;
 s->start_ms = time_msec();
+s->log_susp_it = UINT32_MAX;
+s->log_begin_it = UINT32_MAX;
+s->log_end_it = UINT32_MAX;
+s->log_reason = NULL;
 }
 
 void
@@ -391,6 +413,10 @@ pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
 history_init(&s->milliseconds);
 s->start_ms = time_msec();
 s->milliseconds.sample[0].timestamp = s->start_ms;
+s->log_susp_it = UINT32_MAX;
+s->log_begin_it = UINT32_MAX;
+s->log_end_it = UINT32_MAX;
+s->log_reason = NULL;
 /* Clearing finished. */
 s->clear = false;
 ovs_mutex_unlock(&s->clear_mutex);
@@ -442,6 +468,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
rx_packets,
 uint64_t now_tsc = cycles_counter_update(s);
 struct iter_stats *cum_ms;
 uint64_t cycles, cycles_per_pkt = 0;
+char *reason = NULL;
 
 cycles = now_tsc - s->start_tsc;
 s->current.timestamp = s->iteration_