On Fri, Dec 09, 2022 at 03:32:06PM +0000, Kevin Traynor wrote: > On 04/12/2022 11:16, Cheng Li wrote: > >Currently, pmd_rebalance_dry_run() calculate overall variance of > >all pmds regardless of their numa location. The overall result may > >hide un-balance in an individual numa. > > > > Hi. Thanks, the idea looks a good one. I didn't test yet, couple of > comments on the code below.
Hi Kevin, thanks for the review. > > >Considering the following case. Numa 0 is free because VMs on numa0 > >are not sending pkts, while numa 1 is busy. Within numa 1, pmds > >workloads are not balanced. Obviously, moving 500 kpps workloads from > >pmd 126 to pmd 61 will make numa1 much more balance. For numa1 > >the variance improment will be almost 100%, because after rebalance > >each pmd in numa1 holds same workload(variance ~= 0). But the overall > >variance improvement is only about 20%, which may not tigger auto_lb. > > > >``` > >numa_id core_id kpps > > 0 30 0 > > 0 31 0 > > 0 94 0 > > 0 95 0 > > 1 126 1500 > > 1 127 1000 > > 1 63 1000 > > 1 62 500 > >``` > > > >As auto_lb doesn't work if any coss_numa rxq exists, it means that > > That's not fully true. It can run with cross numa in a very limited > circumstances where there is only PMD active on 1 numa. > > It doesn't diminish the idea here, but just best not write that > blanket statement as it may confuse. Makes sense, I will reword the description. > > >auto_lb only balance rxq assignment within each numa. So it makes > >more sense to calculate variance improvemnet per numa node. > > > >Signed-off-by: Cheng Li <lic...@chinatelecom.cn> > >--- > > lib/dpif-netdev.c | 90 > > +++++++++++++++++++++++++++++-------------------------- > > 1 file changed, 47 insertions(+), 43 deletions(-) > > > > I think at least some docs would need to be updated to say it's > variance per NUMA. After going through the doc directory, I found two pleases: 1. Documentation/topics/dpdk/pmd.rst 2. NEWS > > >diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > >index 2c08a71..6a53f13 100644 > >--- a/lib/dpif-netdev.c > >+++ b/lib/dpif-netdev.c > >@@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp) > > static uint64_t variance(uint64_t a[], int n); > > static uint64_t > >-sched_numa_list_variance(struct sched_numa_list *numa_list) > >+sched_numa_variance(struct sched_numa *numa) > > { > >- struct sched_numa *numa; > > uint64_t *percent_busy = NULL; > >- unsigned total_pmds = 0; > > int n_proc = 0; > > uint64_t var; > >- HMAP_FOR_EACH (numa, node, &numa_list->numas) { > >- total_pmds += numa->n_pmds; > >- percent_busy = xrealloc(percent_busy, > >- total_pmds * sizeof *percent_busy); > >+ percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy); > >- for (unsigned i = 0; i < numa->n_pmds; i++) { > >- struct sched_pmd *sched_pmd; > >- uint64_t total_cycles = 0; > >+ for (unsigned i = 0; i < numa->n_pmds; i++) { > >+ struct sched_pmd *sched_pmd; > >+ uint64_t total_cycles = 0; > >- sched_pmd = &numa->pmds[i]; > >- /* Exclude isolated PMDs from variance calculations. */ > >- if (sched_pmd->isolated == true) { > >- continue; > >- } > >- /* Get the total pmd cycles for an interval. */ > >- atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, > >&total_cycles); > >+ sched_pmd = &numa->pmds[i]; > >+ /* Exclude isolated PMDs from variance calculations. */ > >+ if (sched_pmd->isolated == true) { > >+ continue; > >+ } > >+ /* Get the total pmd cycles for an interval. */ > >+ atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); > >- if (total_cycles) { > >- /* Estimate the cycles to cover all intervals. */ > >- total_cycles *= PMD_INTERVAL_MAX; > >- percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) > >- / total_cycles; > >- } else { > >- percent_busy[n_proc++] = 0; > >- } > >+ if (total_cycles) { > >+ /* Estimate the cycles to cover all intervals. */ > >+ total_cycles *= PMD_INTERVAL_MAX; > >+ percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) > >+ / total_cycles; > >+ } else { > >+ percent_busy[n_proc++] = 0; > > } > > } > > var = variance(percent_busy, n_proc); > >@@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > > struct sched_numa_list numa_list_est; > > bool thresh_met = false; > > uint64_t current_var, estimate_var; > >+ struct sched_numa *numa_cur, *numa_est; > > uint64_t improvement = 0; > > VLOG_DBG("PMD auto load balance performing dry run."); > >@@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > > sched_numa_list_count(&numa_list_est) == 1) { > > /* Calculate variances. */ > >- current_var = sched_numa_list_variance(&numa_list_cur); > >- estimate_var = sched_numa_list_variance(&numa_list_est); > >- > >- if (estimate_var < current_var) { > >- improvement = ((current_var - estimate_var) * 100) / > >current_var; > >- } > >- VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".", > >- current_var, estimate_var); > >- VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement); > >- > >- if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { > >- thresh_met = true; > >- VLOG_DBG("PMD load variance improvement threshold %u%% " > >- "is met.", dp->pmd_alb.rebalance_improve_thresh); > >- } else { > >- VLOG_DBG("PMD load variance improvement threshold " > >- "%u%% is not met.", > >- dp->pmd_alb.rebalance_improve_thresh); > >+ HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) { > >+ numa_est = sched_numa_list_lookup(&numa_list_est, > >+ numa_cur->numa_id); > >+ if (!numa_est) { > >+ continue; > >+ } > >+ current_var = sched_numa_variance(numa_cur); > >+ estimate_var = sched_numa_variance(numa_est); > >+ if (estimate_var < current_var) { > >+ improvement = ((current_var - estimate_var) * 100) > >+ / current_var; > >+ } > >+ VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. > >Current" > >+ " variance %"PRIu64" Estimated variance %"PRIu64".", > >+ numa_cur->numa_id, improvement, > >+ current_var, estimate_var); > > Not sure the reason for changing the logging order? Just the NUMA > node needs to be added. I didn't mean to update the logging order, just want to make the map from numa id to improvement/variance_value be clear. So I chose a simple way, which merges 'numa id', 'improvement' and 'variance value' into one log msg. You prefer variance value goes first, then improvement? like below. "Numa node <I>. Current variance <A> Estimated variance <B>. Variance improvement <P>." > > >+ > >+ if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { > >+ VLOG_DBG("PMD load variance improvement threshold %u%% " > >+ "is met.", dp->pmd_alb.rebalance_improve_thresh); > >+ thresh_met = true; > >+ break; > > > I think it's better to remove the break here and move the "result" > logging outside the for loop. Just set the thresh_met = true at this > point. So then the logging looks like. > > "PMD auto load balance dry run" > "Numa node x. Current y...Estime z...Improve n" > "Numa node x. Current y...Estime z...Improve n" > "PMD load improvement of m is met." or "...not met" > > For debug it would be important to continue the run on the other > NUMA even if the threshold is already met, so removing the break; > would allow this. It makes sense, will update in v2. > > >+ } else { > >+ VLOG_DBG("PMD load variance improvement threshold " > >+ "%u%% is not met.", > >+ dp->pmd_alb.rebalance_improve_thresh); > >+ } > > } > > } else { > > VLOG_DBG("PMD auto load balance detected cross-numa polling with " > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev