On 08/27/2018 04:04 PM, Ilya Maximets wrote: > On 27.08.2018 17:19, Kevin Traynor wrote: >> On 08/27/2018 02:30 PM, Ilya Maximets wrote: >>> On 25.08.2018 13:00, Kevin Traynor wrote: >>>> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs >>>> (i.e. CPUs) was done by round-robin. >>>> >>>> That was changed in OVS 2.9 to ordering the Rxqs based on >>>> their measured processing cycles. This was to assign the >>>> busiest Rxqs to different PMDs, improving aggregate >>>> throughput. >>>> >>>> For the most part the new scheme should be better, but >>>> there could be situations where a user prefers a simple >>>> round-robin scheme because Rxqs from a single port are >>>> more likely to be spread across multiple PMDs, and/or >>>> traffic is very bursty/unpredictable. >>>> >>>> Add 'pmd-rxq-assign' config to allow a user to select >>>> round-robin based assignment. >>>> >>>> Signed-off-by: Kevin Traynor <ktray...@redhat.com> >>>> Acked-by: Eelco Chaudron <echau...@redhat.com> >>>> --- >>>> >>>> V3: >>>> - Rolled in some style and vswitch.xml changes (Ilya) >>>> - Set cycles mode by default on wrong config (Ilya) >>>> >>>> V2: >>>> - simplified nextpmd change (Eelco) >>>> - removed confusing doc sentence (Eelco) >>>> - fixed commit msg (Ilya) >>>> - made change in pmd-rxq-assign value also perform re-assignment (Ilya) >>>> - renamed to roundrobin mode (Ilya) >>>> - moved vswitch.xml changes to right config section (Ilya) >>>> - comment/log updates >>>> - moved NEWS update to separate patch as it's been changing on master >>>> >>>> Documentation/topics/dpdk/pmd.rst | 33 +++++++++++++--- >>>> lib/dpif-netdev.c | 83 >>>> +++++++++++++++++++++++++++++---------- >>>> tests/pmd.at | 12 +++++- >>>> vswitchd/vswitch.xml | 24 +++++++++++ >>>> 4 files changed, 123 insertions(+), 29 deletions(-) >>>> >>>> diff --git a/Documentation/topics/dpdk/pmd.rst >>>> b/Documentation/topics/dpdk/pmd.rst >>>> index 5f0671e..dd9172d 100644 >>>> --- a/Documentation/topics/dpdk/pmd.rst >>>> +++ b/Documentation/topics/dpdk/pmd.rst >>>> @@ -113,10 +113,15 @@ means that this thread will only poll the *pinned* >>>> Rx queues. >>>> >>>> If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned >>>> to PMDs >>>> -(cores) automatically. Where known, the processing cycles that have been >>>> stored >>>> -for each Rx queue will be used to assign Rx queue to PMDs based on a round >>>> -robin of the sorted Rx queues. For example, take the following example, >>>> where >>>> -there are five Rx queues and three cores - 3, 7, and 8 - available and the >>>> -measured usage of core cycles per Rx queue over the last interval is seen >>>> to >>>> -be: >>>> +(cores) automatically. >>>> + >>>> +The algorithm used to automatically assign Rxqs to PMDs can be set by:: >>>> + >>>> + $ ovs-vsctl set Open_vSwitch . >>>> other_config:pmd-rxq-assign=<assignment> >>>> + >>>> +By default, ``cycles`` assignment is used where the Rxqs will be ordered >>>> by >>>> +their measured processing cycles, and then be evenly assigned in >>>> descending >>>> +order to PMDs based on an up/down walk of the PMDs. For example, where >>>> there >>>> +are five Rx queues and three cores - 3, 7, and 8 - available and the >>>> measured >>>> +usage of core cycles per Rx queue over the last interval is seen to be: >>>> >>>> - Queue #0: 30% >>>> @@ -132,4 +137,20 @@ The Rx queues will be assigned to the cores in the >>>> following order:: >>>> Core 8: Q3 (60%) | Q0 (30%) >>>> >>>> +Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are >>>> +assigned to PMDs in a round-robined fashion. This algorithm was used by >>>> +default prior to OVS 2.9. For example, given the following ports and >>>> queues: >>>> + >>>> +- Port #0 Queue #0 (P0Q0) >>>> +- Port #0 Queue #1 (P0Q1) >>>> +- Port #1 Queue #0 (P1Q0) >>>> +- Port #1 Queue #1 (P1Q1) >>>> +- Port #1 Queue #2 (P1Q2) >>>> + >>>> +The Rx queues may be assigned to the cores in the following order:: >>>> + >>>> + Core 3: P0Q0 | P1Q1 >>>> + Core 7: P0Q1 | P1Q2 >>>> + Core 8: P1Q0 | >>>> + >>>> To see the current measured usage history of PMD core cycles for each Rx >>>> queue:: >>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >>>> index 7f836bb..8f004c5 100644 >>>> --- a/lib/dpif-netdev.c >>>> +++ b/lib/dpif-netdev.c >>>> @@ -342,4 +342,6 @@ struct dp_netdev { >>>> struct id_pool *tx_qid_pool; >>>> struct ovs_mutex tx_qid_pool_mutex; >>>> + /* Use measured cycles for rxq to pmd assignment. */ >>>> + bool pmd_rxq_assign_cyc; >>>> >>>> /* Protects the access of the 'struct dp_netdev_pmd_thread' >>>> @@ -1493,4 +1495,5 @@ create_dp_netdev(const char *name, const struct >>>> dpif_class *class, >>>> >>>> cmap_init(&dp->poll_threads); >>>> + dp->pmd_rxq_assign_cyc = true; >>>> >>>> ovs_mutex_init(&dp->tx_qid_pool_mutex); >>>> @@ -3717,4 +3720,6 @@ dpif_netdev_set_config(struct dpif *dpif, const >>>> struct smap *other_config) >>>> struct dp_netdev *dp = get_dp_netdev(dpif); >>>> const char *cmask = smap_get(other_config, "pmd-cpu-mask"); >>>> + const char *pmd_rxq_assign = smap_get_def(other_config, >>>> "pmd-rxq-assign", >>>> + "cycles"); >>>> unsigned long long insert_prob = >>>> smap_get_ullong(other_config, "emc-insert-inv-prob", >>>> @@ -3779,4 +3784,18 @@ dpif_netdev_set_config(struct dpif *dpif, const >>>> struct smap *other_config) >>>> } >>>> } >>>> + >>>> + bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles"); >>>> + if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) { >>>> + VLOG_WARN("Unsupported Rxq to PMD assignment mode: " >>>> + "defaulting to 'cycles'."); >>> >>> Have you missed the actual printing of the wrong value? >>> Or that is what you wanted? >>> >> >> It was what I intended. If someone puts in junk, I didn't want to be >> adding it to the log. >> > > Many functions in ovs prints wrong arguments. For example, LACP mode, > bond_mode, port type, getopt, VLAN mode, tc_set_policy and others. > IMHO, it's not bad to print what exactly the issue is, but I will > not insist. > > One thing is that the colon makes me feel like you wanted to print > the exact string, but missed it. I think you should either replace > the colon with a period, or add a real value. >
Yeah, I can see how the colon might mislead. I got rid of the colon and mentioned pmd-rxq-assign, so the user knows which setting it is (similar to dpdk-init). Kevin. >>>> + pmd_rxq_assign_cyc = true; >>>> + pmd_rxq_assign = "cycles"; >>>> + } >>>> + if (dp->pmd_rxq_assign_cyc != pmd_rxq_assign_cyc) { >>>> + dp->pmd_rxq_assign_cyc = pmd_rxq_assign_cyc; >>>> + VLOG_INFO("Rxq to PMD assignment mode changed to: \'%s\'.", >>>> + pmd_rxq_assign); >>>> + dp_netdev_request_reconfigure(dp); >>>> + } >>>> return 0; >>>> } >>>> @@ -4249,8 +4268,16 @@ rr_numa_list_populate(struct dp_netdev *dp, struct >>>> rr_numa_list *rr) >>>> } >>>> >>>> -/* Returns the next pmd from the numa node in >>>> - * incrementing or decrementing order. */ >>>> +/* >>>> + * Returns the next pmd from the numa node. >>>> + * >>>> + * If 'updown' is 'true' it will alternate between selecting the next pmd >>>> in >>>> + * either an up or down walk, switching between up/down when the first or >>>> last >>>> + * core is reached. e.g. 1,2,3,3,2,1,1,2... >>>> + * >>>> + * If 'updown' is 'false' it will select the next pmd wrapping around >>>> when last >>>> + * core reached. e.g. 1,2,3,1,2,3,1,2... >>>> + */ >>>> static struct dp_netdev_pmd_thread * >>>> -rr_numa_get_pmd(struct rr_numa *numa) >>>> +rr_numa_get_pmd(struct rr_numa *numa, bool updown) >>>> { >>>> int numa_idx = numa->cur_index; >>>> @@ -4260,5 +4287,9 @@ rr_numa_get_pmd(struct rr_numa *numa) >>>> if (numa->cur_index == numa->n_pmds-1) { >>>> /* Reached the last pmd. */ >>>> - numa->idx_inc = false; >>>> + if (updown) { >>>> + numa->idx_inc = false; >>>> + } else { >>>> + numa->cur_index = 0; >>>> + } >>>> } else { >>>> numa->cur_index++; >>>> @@ -4323,7 +4354,4 @@ compare_rxq_cycles(const void *a, const void *b) >>>> * pmds to unpinned queues. >>>> * >>>> - * If 'pinned' is false queues will be sorted by processing cycles they >>>> are >>>> - * consuming and then assigned to pmds in round robin order. >>>> - * >>>> * The function doesn't touch the pmd threads, it just stores the >>>> assignment >>>> * in the 'pmd' member of each rxq. */ >>>> @@ -4338,4 +4366,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) >>>> OVS_REQUIRES(dp->port_mutex) >>>> struct rr_numa *numa = NULL; >>>> int numa_id; >>>> + bool assign_cyc = dp->pmd_rxq_assign_cyc; >>>> >>>> HMAP_FOR_EACH (port, node, &dp->ports) { >>>> @@ -4368,10 +4397,13 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) >>>> OVS_REQUIRES(dp->port_mutex) >>>> rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1)); >>>> } >>>> - /* Sum the queue intervals and store the cycle history. */ >>>> - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { >>>> - cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); >>>> - } >>>> - dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, >>>> cycle_hist); >>>> >>>> + if (assign_cyc) { >>>> + /* Sum the queue intervals and store the cycle >>>> history. */ >>>> + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { >>>> + cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, >>>> i); >>>> + } >>>> + dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, >>>> + cycle_hist); >>>> + } >>>> /* Store the queue. */ >>>> rxqs[n_rxqs++] = q; >>>> @@ -4380,5 +4412,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) >>>> OVS_REQUIRES(dp->port_mutex) >>>> } >>>> >>>> - if (n_rxqs > 1) { >>>> + if (n_rxqs > 1 && assign_cyc) { >>>> /* Sort the queues in order of the processing cycles >>>> * they consumed during their last pmd interval. */ >>>> @@ -4404,5 +4436,5 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) >>>> OVS_REQUIRES(dp->port_mutex) >>>> continue; >>>> } >>>> - rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa); >>>> + rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc); >>>> VLOG_WARN("There's no available (non-isolated) pmd thread " >>>> "on numa node %d. Queue %d on port \'%s\' will " >>>> @@ -4413,11 +4445,20 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) >>>> OVS_REQUIRES(dp->port_mutex) >>>> rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id); >>>> } else { >>>> - rxqs[i]->pmd = rr_numa_get_pmd(numa); >>>> - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " >>>> - "rx queue %d (measured processing cycles %"PRIu64").", >>>> - rxqs[i]->pmd->core_id, numa_id, >>>> - netdev_rxq_get_name(rxqs[i]->rx), >>>> - netdev_rxq_get_queue_id(rxqs[i]->rx), >>>> - dp_netdev_rxq_get_cycles(rxqs[i], >>>> RXQ_CYCLES_PROC_HIST)); >>>> + rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc); >>>> + if (assign_cyc) { >>>> + VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " >>>> + "rx queue %d " >>>> + "(measured processing cycles %"PRIu64").", >>>> + rxqs[i]->pmd->core_id, numa_id, >>>> + netdev_rxq_get_name(rxqs[i]->rx), >>>> + netdev_rxq_get_queue_id(rxqs[i]->rx), >>>> + dp_netdev_rxq_get_cycles(rxqs[i], >>>> + RXQ_CYCLES_PROC_HIST)); >>>> + } else { >>>> + VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " >>>> + "rx queue %d.", rxqs[i]->pmd->core_id, numa_id, >>>> + netdev_rxq_get_name(rxqs[i]->rx), >>>> + netdev_rxq_get_queue_id(rxqs[i]->rx)); >>>> + } >>>> } >>>> } >>>> diff --git a/tests/pmd.at b/tests/pmd.at >>>> index 4cae6c8..1f952f3 100644 >>>> --- a/tests/pmd.at >>>> +++ b/tests/pmd.at >>>> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [ >>>> >>>> m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id >>>> \)[[0-9]]*:/\1<cleared>\2<cleared>:/"]) >>>> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 >>>> 7/<group>/"]) >>>> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 >>>> 7/<group>/"]) >>>> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 2 4 >>>> 6/<group>/"]) >>>> m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"]) >>>> >>>> @@ -146,9 +147,16 @@ pmd thread numa_id <cleared> core_id <cleared>: >>>> ]) >>>> >>>> +AT_CHECK([ovs-vsctl set Open_vSwitch . >>>> other_config:pmd-rxq-assign=cycles]) >>>> TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) >>>> AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) >>>> CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) >>>> >>>> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed >>>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed >>>> SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl >>>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed >>>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed >>>> SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl >>>> +port: p0 queue-id: <group> >>>> +port: p0 queue-id: <group> >>>> +]) >>>> + >>>> +AT_CHECK([ovs-vsctl set Open_vSwitch . >>>> other_config:pmd-rxq-assign=roundrobin]) >>>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed >>>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed >>>> SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl >>>> port: p0 queue-id: <group> >>>> port: p0 queue-id: <group> >>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >>>> index 0cd8520..2439b24 100644 >>>> --- a/vswitchd/vswitch.xml >>>> +++ b/vswitchd/vswitch.xml >>>> @@ -425,4 +425,28 @@ >>>> </column> >>>> >>>> + <column name="other_config" key="pmd-rxq-assign" >>>> + type='{"type": "string", >>>> + "enum": ["set", ["cycles", "roundrobin"]]}'> >>>> + <p> >>>> + Specifies how RX queues will be automatically assigned to CPU >>>> cores. >>>> + Options: >>>> + <dl> >>>> + <dt><code>cycles</code></dt> >>>> + <dd>Rxqs will be sorted by order of measured processing cycles >>>> + before being assigned to CPU cores.</dd> >>>> + <dt><code>roundrobin</code></dt> >>>> + <dd>Rxqs will be round-robined across CPU cores.</dd> >>>> + </dl> >>>> + </p> >>>> + <p> >>>> + The default value is <code>cycles</code>. >>>> + </p> >>>> + <p> >>>> + Changing this value will affect an automatic re-assignment of >>>> Rxqs to >>>> + CPUs. Note: Rxqs mapped to CPU cores with >>>> + <code>pmd-rxq-affinity</code> are unaffected. >>>> + </p> >>>> + </column> >>>> + >>>> <column name="other_config" key="n-handler-threads" >>>> type='{"type": "integer", "minInteger": 1}'> >>>> >> >> >> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev