On 08/22/2018 08:32 AM, Eelco Chaudron wrote: > > > On 21 Aug 2018, at 19:15, Kevin Traynor wrote: > >> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs >> (i.e. CPUs) was done by assigning Rxqs in an ascending >> port/queue order, round robined across the available >> PMDs. >> >> That was changed in OVS 2.9 to order the Rxqs by the >> measured processing cycles for each, in order to try >> and keep the busiest Rxqs on different PMDs. >> >> For the most part the new scheme should be better, but >> there could be situations where a user prefers a >> port/queue scheme because Rxqs from a single port are >> more likely to be spread across multiple cores, and/or >> traffic is very bursty/unpredictable. >> >> Add the option to have a port/queue based assignment. > > Some small comments below… Rest looks good. >
Thanks for reviewing > //Eelco > >> Signed-off-by: Kevin Traynor <ktray...@redhat.com> >> --- >> Documentation/topics/dpdk/pmd.rst | 34 +++++++++-- >> NEWS | 2 + >> lib/dpif-netdev.c | 115 >> +++++++++++++++++++++++++++----------- >> tests/pmd.at | 15 ++++- >> vswitchd/vswitch.xml | 23 ++++++++ >> 5 files changed, 147 insertions(+), 42 deletions(-) >> >> diff --git a/Documentation/topics/dpdk/pmd.rst >> b/Documentation/topics/dpdk/pmd.rst >> index 5f0671e..e8628cc 100644 >> --- a/Documentation/topics/dpdk/pmd.rst >> +++ b/Documentation/topics/dpdk/pmd.rst >> @@ -113,10 +113,15 @@ means that this thread will only poll the >> *pinned* Rx queues. >> >> If ``pmd-rxq-affinity`` is not set for Rx queues, they will be >> assigned to PMDs >> -(cores) automatically. Where known, the processing cycles that have >> been stored >> -for each Rx queue will be used to assign Rx queue to PMDs based on a >> round >> -robin of the sorted Rx queues. For example, take the following >> example, where >> -there are five Rx queues and three cores - 3, 7, and 8 - available >> and the >> -measured usage of core cycles per Rx queue over the last interval is >> seen to >> -be: >> +(cores) automatically. >> + >> +The algorithm used to automatically assign Rxqs to PMDs can be set by:: >> + >> + $ ovs-vsctl set Open_vSwitch . >> other_config:pmd-rxq-assign=<assignment> >> + >> +By default, ``cycles`` assignment is used where the Rxqs will be >> ordered by >> +their measured processing cycles, and then be assigned in descending >> order to >> +PMDs based on a round robin up/down walk of the PMDs. For example, >> where there >> +are five Rx queues and three cores - 3, 7, and 8 - available and the >> measured >> +usage of core cycles per Rx queue over the last interval is seen to be: >> >> - Queue #0: 30% >> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in >> the following order:: >> Core 8: Q3 (60%) | Q0 (30%) >> >> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs are >> +ordered by their port/queue number, and then assigned in ascending >> order to >> +PMDs based on a round robin of the PMDs. This algorithm was used by >> default >> +prior to OVS 2.9. For example, given the following ports and queues: >> + >> +- Port #0 Queue #0 (P0Q0) >> +- Port #0 Queue #1 (P0Q1) >> +- Port #1 Queue #0 (P1Q0) >> +- Port #1 Queue #1 (P1Q1) >> +- Port #1 Queue #2 (P1Q2) >> + >> +The Rx queues will be assigned to the cores in the following order:: >> + >> + Core 3: P0Q0 | P1Q1 >> + Core 7: P0Q1 | P1Q2 >> + Core 8: P1Q0 | >> + >> To see the current measured usage history of PMD core cycles for each Rx >> queue:: >> diff --git a/NEWS b/NEWS >> index 8987f9a..9e809d1 100644 >> --- a/NEWS >> +++ b/NEWS >> @@ -5,4 +5,6 @@ Post-v2.10.0 >> - The environment variable OVS_CTL_TIMEOUT, if set, is now used >> as the default timeout for control utilities. >> + - DPDK: >> + * Add option for port/queue based Rxq to PMD assignment. >> >> >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >> index 7f836bb..507906c 100644 >> --- a/lib/dpif-netdev.c >> +++ b/lib/dpif-netdev.c >> @@ -342,4 +342,7 @@ struct dp_netdev { >> struct id_pool *tx_qid_pool; >> struct ovs_mutex tx_qid_pool_mutex; >> + /* Stores whether an rxq's used cycles should be >> + * used to influence assignment to pmds. */ >> + atomic_bool pmd_rxq_assign_cyc; >> >> /* Protects the access of the 'struct dp_netdev_pmd_thread' >> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct >> dpif_class *class, >> atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN); >> atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL); >> + atomic_init(&dp->pmd_rxq_assign_cyc, true); >> >> cmap_init(&dp->poll_threads); >> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const >> struct smap *other_config) >> struct dp_netdev *dp = get_dp_netdev(dpif); >> const char *cmask = smap_get(other_config, "pmd-cpu-mask"); >> + const char *pmd_rxq_assign = smap_get_def(other_config, >> "pmd-rxq-assign", >> + "cycles"); >> unsigned long long insert_prob = >> smap_get_ullong(other_config, "emc-insert-inv-prob", >> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const >> struct smap *other_config) >> } >> } >> + >> + bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles"); >> + if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) { >> + bool cur_pmd_rxq_assign_cyc; >> + atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, >> &cur_pmd_rxq_assign_cyc); >> + if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) { >> + atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, >> pmd_rxq_assign_cyc); >> + if (pmd_rxq_assign_cyc) { >> + VLOG_INFO("Rxq to PMD assignment mode: " >> + "Sort queues by processing cycles."); >> + } else { >> + VLOG_INFO("Rxq to PMD assignment mode: " >> + "Sort queues by port/queue number."); >> + } >> + } >> + } >> return 0; >> } >> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp, >> struct rr_numa_list *rr) >> } >> >> -/* Returns the next pmd from the numa node in >> - * incrementing or decrementing order. */ >> +/* Returns the next pmd from the numa node. >> + * >> + * If updown is 'true' it will alternate between >> + * selecting the next pmd in either an up or down >> + * walk, switching between up/down each time the >> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2... >> + * >> + * If updown is 'false' it will select the next pmd >> + * in ascending order, wrapping around when max core >> + * is reached. e.g. 1,2,3,1,2,3,1,2... */ >> static struct dp_netdev_pmd_thread * >> -rr_numa_get_pmd(struct rr_numa *numa) >> +rr_numa_get_pmd(struct rr_numa *numa, bool updown) >> { >> - int numa_idx = numa->cur_index; >> + int numa_idx; >> >> - if (numa->idx_inc == true) { >> - /* Incrementing through list of pmds. */ >> - if (numa->cur_index == numa->n_pmds-1) { >> - /* Reached the last pmd. */ >> - numa->idx_inc = false; >> + if (updown) { >> + numa_idx = numa->cur_index; >> + if (numa->idx_inc == true) { >> + /* Incrementing through list of pmds. */ >> + if (numa->cur_index == numa->n_pmds-1) { >> + /* Reached the last pmd. */ >> + numa->idx_inc = false; >> + } else { >> + numa->cur_index++; >> + } >> } else { >> - numa->cur_index++; >> + /* Decrementing through list of pmds. */ >> + if (numa->cur_index == 0) { >> + /* Reached the first pmd. */ >> + numa->idx_inc = true; >> + } else { >> + numa->cur_index--; >> + } >> } >> } else { >> - /* Decrementing through list of pmds. */ >> - if (numa->cur_index == 0) { >> - /* Reached the first pmd. */ >> - numa->idx_inc = true; >> - } else { >> - numa->cur_index--; >> - } >> + numa_idx = numa->cur_index++ % numa->n_pmds; >> } > > This looks like a big change, how about the following: > > @@ -4259,7 +4259,11 @@ rr_numa_get_pmd(struct rr_numa *numa) > /* Incrementing through list of pmds. */ > if (numa->cur_index == numa->n_pmds-1) { > /* Reached the last pmd. */ > - numa->idx_inc = false; > + if (updown) { > + numa->idx_inc = false; > + } else { > + numa->cur_index = 0; > + } > } else { > I had thought it could be simpler by using the previously used LOC for that case but perhaps it is better to do the suggested and not have a mix of whether the mask is required or not. Changed to suggested. > >> return numa->pmds[numa_idx]; >> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b) >> * pmds to unpinned queues. >> * >> - * If 'pinned' is false queues will be sorted by processing cycles >> they are >> - * consuming and then assigned to pmds in round robin order. >> - * >> * The function doesn't touch the pmd threads, it just stores the >> assignment >> * in the 'pmd' member of each rxq. */ >> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool >> pinned) OVS_REQUIRES(dp->port_mutex) >> struct rr_numa *numa = NULL; >> int numa_id; >> + bool assign_cyc; >> >> + atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc); >> HMAP_FOR_EACH (port, node, &dp->ports) { >> if (!netdev_is_pmd(port->netdev)) { >> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool >> pinned) OVS_REQUIRES(dp->port_mutex) >> rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1)); >> } >> - /* Sum the queue intervals and store the cycle >> history. */ >> - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { >> - cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); >> - } >> - dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, >> cycle_hist); >> >> + if (assign_cyc) { >> + /* Sum the queue intervals and store the cycle >> history. */ >> + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; >> i++) { >> + cycle_hist += >> dp_netdev_rxq_get_intrvl_cycles(q, i); >> + } >> + dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, >> + cycle_hist); >> + } >> /* Store the queue. */ >> rxqs[n_rxqs++] = q; >> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool >> pinned) OVS_REQUIRES(dp->port_mutex) >> } >> >> - if (n_rxqs > 1) { >> + if (n_rxqs > 1 && assign_cyc) { >> /* Sort the queues in order of the processing cycles >> * they consumed during their last pmd interval. */ >> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool >> pinned) OVS_REQUIRES(dp->port_mutex) >> continue; >> } >> - rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa); >> + rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc); >> VLOG_WARN("There's no available (non-isolated) pmd thread " >> "on numa node %d. Queue %d on port \'%s\' will " >> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool >> pinned) OVS_REQUIRES(dp->port_mutex) >> rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id); >> } else { >> - rxqs[i]->pmd = rr_numa_get_pmd(numa); >> - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " >> - "rx queue %d (measured processing cycles %"PRIu64").", >> - rxqs[i]->pmd->core_id, numa_id, >> - netdev_rxq_get_name(rxqs[i]->rx), >> - netdev_rxq_get_queue_id(rxqs[i]->rx), >> - dp_netdev_rxq_get_cycles(rxqs[i], >> RXQ_CYCLES_PROC_HIST)); >> + rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc); >> + if (assign_cyc) { >> + VLOG_INFO("Core %d on numa node %d assigned port >> \'%s\' " >> + "rx queue %d " >> + "(measured processing cycles %"PRIu64").", >> + rxqs[i]->pmd->core_id, numa_id, >> + netdev_rxq_get_name(rxqs[i]->rx), >> + netdev_rxq_get_queue_id(rxqs[i]->rx), >> + dp_netdev_rxq_get_cycles(rxqs[i], >> + >> RXQ_CYCLES_PROC_HIST)); >> + } else { >> + VLOG_INFO("Core %d on numa node %d assigned port >> \'%s\' " >> + "rx queue %d.", rxqs[i]->pmd->core_id, >> numa_id, >> + netdev_rxq_get_name(rxqs[i]->rx), >> + netdev_rxq_get_queue_id(rxqs[i]->rx)); >> + } >> + >> } >> } >> diff --git a/tests/pmd.at b/tests/pmd.at >> index 4cae6c8..7cbaf41 100644 >> --- a/tests/pmd.at >> +++ b/tests/pmd.at >> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [ >> >> m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id >> \)[[0-9]]*:/\1<cleared>\2<cleared>:/"]) >> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 >> 7/<group>/"]) >> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 >> 3 4 7/<group>/"]) >> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 >> 2 4 6/<group>/"]) >> m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"]) >> >> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>: >> ]) >> >> +AT_CHECK([ovs-vsctl set Open_vSwitch . >> other_config:pmd-rxq-assign=cycles]) > > Should you not do a rebalance here, you are “theoretically” changing the > config here. > It won't be needed here or below now that re-assignment will automatically happen when the assignment mode changes. >> TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) >> AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) >> CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) >> >> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed >> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed >> SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl >> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed >> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed >> SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl >> +port: p0 queue-id: <group> >> +port: p0 queue-id: <group> >> +]) >> + >> +AT_CHECK([ovs-vsctl set Open_vSwitch . >> other_config:pmd-rxq-assign=portqueue]) >> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl >> +pmd rxq rebalance requested. >> +]) >> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed >> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed >> SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl >> port: p0 queue-id: <group> >> port: p0 queue-id: <group> >> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >> index 0cd8520..b5789a5 100644 >> --- a/vswitchd/vswitch.xml >> +++ b/vswitchd/vswitch.xml >> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 >> type=patch options:peer=p1 \ >> </column> >> >> + <column name="other_config" key="pmd-rxq-assign" >> + type='{"type": "string"}'> >> + <p> >> + Specifies how RX queues are automatically assigned to CPU >> cores. >> + Options: >> + <code>cycles</code> - Sort queues in descending order of >> + measured processing cycles before assigning round robin >> + to CPUs. >> + <code>portqueue</code> - Sort queues in ascending order >> + of port/queue number before assign round robin to CPUs. >> + </p> >> + <p> >> + The default value is <code>cycles</code>. >> + </p> >> + <p> >> + Note this this will not automatically perform a reschedule. >> + When the mode is set to <code>portqueue</code> the >> + <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> >> command will >> + not change the Rxq to PMD assignment, as the >> Port/Queue/PMDs config >> + will be unchanged from the last reconfiguration. > > The last part of the rebalance in portqueue mode is a bit confusing. I > thought initially you meant when setting it to portqueue mode, but its > the case its already set to portqueue mode and run rebalance again. > You're right and the last sentence was not really needed. Anyway, the operation has changed in v2 and I've replaced this paragraph. thanks, Kevin. >> + </p> >> + </column> >> + >> <column name="options" key="vhost-server-path" >> type='{"type": "string"}'> >> -- >> 1.8.3.1 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev