Re: [ovs-dev] [PATCH] dpif-netdev: calculate per numa variance

Cheng Li Fri, 09 Dec 2022 23:21:53 -0800

On Fri, Dec 09, 2022 at 03:32:06PM +0000, Kevin Traynor wrote:
> On 04/12/2022 11:16, Cheng Li wrote:
> >Currently, pmd_rebalance_dry_run() calculate overall variance of
> >all pmds regardless of their numa location. The overall result may
> >hide un-balance in an individual numa.
> >
> 
> Hi. Thanks, the idea looks a good one. I didn't test yet, couple of
> comments on the code below.


Hi Kevin, thanks for the review.

> 
> >Considering the following case. Numa 0 is free because VMs on numa0
> >are not sending pkts, while numa 1 is busy. Within numa 1, pmds
> >workloads are not balanced. Obviously, moving 500 kpps workloads from
> >pmd 126 to pmd 61 will make numa1 much more balance. For numa1
> >the variance improment will be almost 100%, because after rebalance
> >each pmd in numa1 holds same workload(variance ~= 0). But the overall
> >variance improvement is only about 20%, which may not tigger auto_lb.
> >
> >```
> >numa_id   core_id      kpps
> >       0        30         0
> >       0        31         0
> >       0        94         0
> >       0        95         0
> >       1       126      1500
> >       1       127      1000
> >       1        63      1000
> >       1        62       500
> >```
> >
> >As auto_lb doesn't work if any coss_numa rxq exists, it means that
> 
> That's not fully true. It can run with cross numa in a very limited
> circumstances where there is only PMD active on 1 numa.
> 
> It doesn't diminish the idea here, but just best not write that
> blanket statement as it may confuse.

Makes sense, I will reword the description.

> 
> >auto_lb only balance rxq assignment within each numa. So it makes
> >more sense to calculate variance improvemnet per numa node.
> >
> >Signed-off-by: Cheng Li <lic...@chinatelecom.cn>
> >---
> >  lib/dpif-netdev.c | 90 
> > +++++++++++++++++++++++++++++--------------------------
> >  1 file changed, 47 insertions(+), 43 deletions(-)
> >
> 
> I think at least some docs would need to be updated to say it's
> variance per NUMA.

After going through the doc directory, I found two pleases:
1. Documentation/topics/dpdk/pmd.rst
2. NEWS

> 
> >diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >index 2c08a71..6a53f13 100644
> >--- a/lib/dpif-netdev.c
> >+++ b/lib/dpif-netdev.c
> >@@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp)
> >  static uint64_t variance(uint64_t a[], int n);
> >  static uint64_t
> >-sched_numa_list_variance(struct sched_numa_list *numa_list)
> >+sched_numa_variance(struct sched_numa *numa)
> >  {
> >-    struct sched_numa *numa;
> >      uint64_t *percent_busy = NULL;
> >-    unsigned total_pmds = 0;
> >      int n_proc = 0;
> >      uint64_t var;
> >-    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
> >-        total_pmds += numa->n_pmds;
> >-        percent_busy = xrealloc(percent_busy,
> >-                                total_pmds * sizeof *percent_busy);
> >+    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
> >-        for (unsigned i = 0; i < numa->n_pmds; i++) {
> >-            struct sched_pmd *sched_pmd;
> >-            uint64_t total_cycles = 0;
> >+    for (unsigned i = 0; i < numa->n_pmds; i++) {
> >+        struct sched_pmd *sched_pmd;
> >+        uint64_t total_cycles = 0;
> >-            sched_pmd = &numa->pmds[i];
> >-            /* Exclude isolated PMDs from variance calculations. */
> >-            if (sched_pmd->isolated == true) {
> >-                continue;
> >-            }
> >-            /* Get the total pmd cycles for an interval. */
> >-            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, 
> >&total_cycles);
> >+        sched_pmd = &numa->pmds[i];
> >+        /* Exclude isolated PMDs from variance calculations. */
> >+        if (sched_pmd->isolated == true) {
> >+            continue;
> >+        }
> >+        /* Get the total pmd cycles for an interval. */
> >+        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
> >-            if (total_cycles) {
> >-                /* Estimate the cycles to cover all intervals. */
> >-                total_cycles *= PMD_INTERVAL_MAX;
> >-                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> >-                                             / total_cycles;
> >-            } else {
> >-                percent_busy[n_proc++] = 0;
> >-            }
> >+        if (total_cycles) {
> >+            /* Estimate the cycles to cover all intervals. */
> >+            total_cycles *= PMD_INTERVAL_MAX;
> >+            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> >+                                            / total_cycles;
> >+        } else {
> >+            percent_busy[n_proc++] = 0;
> >          }
> >      }
> >      var = variance(percent_busy, n_proc);
> >@@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
> >      struct sched_numa_list numa_list_est;
> >      bool thresh_met = false;
> >      uint64_t current_var, estimate_var;
> >+    struct sched_numa *numa_cur, *numa_est;
> >      uint64_t improvement = 0;
> >      VLOG_DBG("PMD auto load balance performing dry run.");
> >@@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
> >              sched_numa_list_count(&numa_list_est) == 1) {
> >          /* Calculate variances. */
> >-        current_var = sched_numa_list_variance(&numa_list_cur);
> >-        estimate_var = sched_numa_list_variance(&numa_list_est);
> >-
> >-        if (estimate_var < current_var) {
> >-             improvement = ((current_var - estimate_var) * 100) / 
> >current_var;
> >-        }
> >-        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
> >-                 current_var, estimate_var);
> >-        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
> >-
> >-        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> >-            thresh_met = true;
> >-            VLOG_DBG("PMD load variance improvement threshold %u%% "
> >-                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
> >-        } else {
> >-            VLOG_DBG("PMD load variance improvement threshold "
> >-                     "%u%% is not met.",
> >-                      dp->pmd_alb.rebalance_improve_thresh);
> >+        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
> >+            numa_est = sched_numa_list_lookup(&numa_list_est,
> >+                                              numa_cur->numa_id);
> >+            if (!numa_est) {
> >+                continue;
> >+            }
> >+            current_var = sched_numa_variance(numa_cur);
> >+            estimate_var = sched_numa_variance(numa_est);
> >+            if (estimate_var < current_var) {
> >+                improvement = ((current_var - estimate_var) * 100)
> >+                              / current_var;
> >+            }
> >+            VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. 
> >Current"
> >+                     " variance %"PRIu64" Estimated variance %"PRIu64".",
> >+                     numa_cur->numa_id, improvement,
> >+                     current_var, estimate_var);
> 
> Not sure the reason for changing the logging order? Just the NUMA
> node needs to be added.

I didn't mean to update the logging order, just want to make the map
from numa id to improvement/variance_value be clear. So I chose a simple
way, which merges 'numa id', 'improvement' and 'variance value' into one
log msg. You prefer variance value goes first, then improvement? like
below.
"Numa node <I>. Current variance <A> Estimated variance <B>. Variance
improvement <P>."

> 
> >+
> >+            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> >+                VLOG_DBG("PMD load variance improvement threshold %u%% "
> >+                        "is met.", dp->pmd_alb.rebalance_improve_thresh);
> >+                thresh_met = true;
> >+                break;
> 
> 
> I think it's better to remove the break here and move the "result"
> logging outside the for loop. Just set the thresh_met = true at this
> point. So then the logging looks like.
> 
> "PMD auto load balance dry run"
> "Numa node x. Current y...Estime z...Improve n"
> "Numa node x. Current y...Estime z...Improve n"
> "PMD load improvement of m is met." or "...not met"
> 
> For debug it would be important to continue the run on the other
> NUMA even if the threshold is already met, so removing the break;
> would allow this.

It makes sense, will update in v2.

> 
> >+            } else {
> >+                VLOG_DBG("PMD load variance improvement threshold "
> >+                        "%u%% is not met.",
> >+                        dp->pmd_alb.rebalance_improve_thresh);
> >+            }
> >          }
> >      } else {
> >          VLOG_DBG("PMD auto load balance detected cross-numa polling with "
> 
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] dpif-netdev: calculate per numa variance

Reply via email to