On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote: > From: Jan Scheurich <jan.scheur...@ericsson.com> > > Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic > assignment of the rxqs of a port only if there are no local,non-isolated PMDs. > > On typical servers with both physical ports on one NUMA node, this often > leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources. > The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also > has drawbacks as it limits OVS' ability to auto load-balance the rxqs. > > This patch introduces a new interface configuration option to allow ports to > be automatically polled by PMDs on any NUMA node: > > ovs-vsctl set interface <Name> other_config:cross-numa-polling=true > > The group assignment algorithm now has the ability to select lowest loaded PMD > on any NUMA, and not just the local NUMA on which the rxq of the port resides > > If this option is not present or set to false, legacy behaviour applies. > > Co-authored-by: Anurag Agarwal <anurag.agar...@ericsson.com> > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> > Signed-off-by: Anurag Agarwal <anurag.agar...@ericsson.com> > --- > > Changes in this patch: > - Addressed comments from Kevin Traynor > > Please refer this thread for an earlier discussion on this topic: > https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392310.html > --- > Documentation/topics/dpdk/pmd.rst | 23 ++++++ > lib/dpif-netdev.c | 130 ++++++++++++++++++++++-------- > tests/pmd.at | 38 +++++++++ > vswitchd/vswitch.xml | 20 +++++ > 4 files changed, 177 insertions(+), 34 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst > b/Documentation/topics/dpdk/pmd.rst > index b259cc8b3..387f962d1 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -99,6 +99,25 @@ core cycles for each Rx queue:: > > $ ovs-appctl dpif-netdev/pmd-rxq-show > > +Normally, Rx queues are assigned to PMD threads automatically. By default > +OVS only assigns Rx queues to PMD threads executing on the same NUMA > +node in order to avoid unnecessary latency for accessing packet buffers > +across the NUMA boundary. Typically this overhead is higher for vhostuser > +ports than for physical ports due to the packet copy that is done for all > +rx packets. > + > +On NUMA servers with physical ports only on one NUMA node, the NUMA-local > +polling policy can lead to an under-utilization of the PMD threads on the > +remote NUMA node. For the overall OVS performance it may in such cases be > +beneficial to utilize the spare capacity and allow polling of a physical > +port's rxqs across NUMA nodes despite the overhead involved. > +The policy can be set per port with the following configuration option:: > + > + $ ovs-vsctl set Interface <iface> \ > + other_config:cross-numa-polling=true|false > + > +The default value is false. > + > .. note:: > > A history of one minute is recorded and shown for each Rx queue to allow > for > @@ -115,6 +134,10 @@ core cycles for each Rx queue:: > A ``overhead`` statistics is shown per PMD: it represents the number of > cycles inherently consumed by the OVS PMD processing loop. > > +.. versionchanged:: 2.18.0 > + > + Added the interface parameter ``other_config:cross-numa-polling`` > + > Rx queue to PMD assignment takes place whenever there are configuration > changes > or can be triggered by using:: > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index ff57b3961..86f88964b 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -467,6 +467,7 @@ struct dp_netdev_port { > char *type; /* Port type as requested by user. */ > char *rxq_affinity_list; /* Requested affinity of rx queues. */ > enum txq_req_mode txq_requested_mode; > + bool cross_numa_polling; /* If true cross polling will be enabled. */ > }; > > static bool dp_netdev_flow_ref(struct dp_netdev_flow *); > @@ -2101,6 +2102,7 @@ port_create(const char *devname, const char *type, > port->sf = NULL; > port->emc_enabled = true; > port->need_reconfigure = true; > + port->cross_numa_polling = false; > ovs_mutex_init(&port->txq_used_mutex); > > *portp = port; > @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, > odp_port_t port_no, > bool emc_enabled = smap_get_bool(cfg, "emc-enable", true); > const char *tx_steering_mode = smap_get(cfg, "tx-steering"); > enum txq_req_mode txq_mode; > + bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", > false); > > ovs_rwlock_wrlock(&dp->port_rwlock); > error = get_port_by_number(dp, port_no, &port); > @@ -5020,6 +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, > odp_port_t port_no, > goto unlock; > } > > + if (cross_numa_polling != port->cross_numa_polling) { > + port->cross_numa_polling = cross_numa_polling; > + VLOG_INFO("%s:cross-numa-polling has been %s.", > + netdev_get_name(port->netdev), > + cross_numa_polling? "enabled" : "disabled"); > + dp_netdev_request_reconfigure(dp); > + } > + > if (emc_enabled != port->emc_enabled) { > struct dp_netdev_pmd_thread *pmd; > struct ds ds = DS_EMPTY_INITIALIZER; > @@ -5751,7 +5762,7 @@ compare_rxq_cycles(const void *a, const void *b) > > static bool > sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd, > - bool has_proc) > + int port_numa_id, bool has_proc) > { > uint64_t current_num, pmd_num; > > @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd > *current_lowest, struct sched_pmd *pmd, > pmd_num = pmd->n_rxq; > } > > - if (pmd_num < current_num) { > - return true; > + if (pmd_num != current_num) { > + return (pmd_num < current_num) ? true : false; > } > - return false; > + > + /* In case of a tie, select the new PMD only if the existing PMD > + * assigned is from non-local NUMA > + */ > + return (current_lowest->numa->numa_id != port_numa_id && > + pmd->numa->numa_id == port_numa_id) ? true : false; > + > } > > static struct sched_pmd * > sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc) > { > - struct sched_pmd *lowest_sched_pmd = NULL; > + struct sched_pmd *lowest_pmd = NULL; > > for (unsigned i = 0; i < numa->n_pmds; i++) { > struct sched_pmd *sched_pmd; > @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa *numa, bool > has_cyc) > if (sched_pmd->isolated) { > continue; > } > - if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) { > - lowest_sched_pmd = sched_pmd; > + if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) { > + lowest_pmd = sched_pmd; > } > } > - return lowest_sched_pmd; > + return lowest_pmd; > } > > /* > @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum sched_assignment_type > algo, uint64_t cycles) > return a; > } > > +static struct sched_pmd * > +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list, > + int port_numa_id, bool has_proc) { > + int n_numa; > + struct sched_numa *numa = NULL; > + struct sched_numa *last_numa = NULL; > + struct sched_pmd *lowest_pmd = NULL; > + struct sched_pmd *pmd; > + > + n_numa = sched_numa_list_count(numa_list); > + /* For all numas. */ > + for (int i = 0; i < n_numa; i++) { > + last_numa = numa; > + numa = sched_numa_list_next(numa_list, last_numa); > + > + /* Get the lowest pmd per numa. */ > + pmd = sched_pmd_get_lowest(numa, has_proc); > + > + /* Check if it's the lowest pmd for all numas. */ > + if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) { > + lowest_pmd = pmd; > + } > + } > + return lowest_pmd; > +} > + > static void > sched_numa_list_schedule(struct sched_numa_list *numa_list, > struct dp_netdev *dp, > @@ -5991,6 +6034,7 @@ sched_numa_list_schedule(struct sched_numa_list > *numa_list, > struct sched_pmd *sched_pmd = NULL; > struct sched_numa *numa; > int port_numa_id; > + bool cross_numa = rxq->port->cross_numa_polling;
It may be better to schedule non-cross-numa rxqs first, and then cross-numa rxqs. Otherwise, one numa may hold much more load because of later scheduled non-cross-numa rxqs. > uint64_t proc_cycles; > char rxq_cyc_log[MAX_RXQ_CYC_STRLEN]; > > @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct sched_numa_list > *numa_list, > start_logged = true; > } > > - /* Store the cycles for this rxq as we will log these later. */ > + /* Store the cycles for this rxq as we will need these later. */ > proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST); > > port_numa_id = netdev_get_numa_id(rxq->port->netdev); > > - /* Select numa. */ > - numa = sched_numa_list_lookup(numa_list, port_numa_id); > - > - /* Check if numa has no PMDs or no non-isolated PMDs. */ > - if (!numa || !sched_numa_noniso_pmd_count(numa)) { > - /* Unable to use this numa to find a PMD. */ > - numa = NULL; > - /* Find any numa with available PMDs. */ > - for (int j = 0; j < n_numa; j++) { > - numa = sched_numa_list_next(numa_list, last_cross_numa); > - last_cross_numa = numa; > - if (sched_numa_noniso_pmd_count(numa)) { > - break; > - } > + if (cross_numa && algo == SCHED_GROUP) { > + /* cross_numa polling enabled so find lowest loaded pmd across > + * all numas. */ > + sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, > port_numa_id, > + proc_cycles); > + } else { > + /* Select numa. */ > + numa = sched_numa_list_lookup(numa_list, port_numa_id); > + > + /* Check if numa has no PMDs or no non-isolated PMDs. */ > + if (!numa || !sched_numa_noniso_pmd_count(numa)) { > + /* Unable to use this numa to find a PMD. */ > numa = NULL; > + /* Find any numa with available PMDs. */ > + for (int j = 0; j < n_numa; j++) { > + numa = sched_numa_list_next(numa_list, last_cross_numa); > + last_cross_numa = numa; > + if (sched_numa_noniso_pmd_count(numa)) { > + break; > + } > + numa = NULL; > + } > } > - } > > - if (numa) { > - /* Select the PMD that should be used for this rxq. */ > - sched_pmd = sched_pmd_next(numa, algo, > - proc_cycles ? true : false); > + if (numa) { > + /* Select the PMD that should be used for this rxq. */ > + sched_pmd = sched_pmd_next(numa, algo, > + proc_cycles ? true : false); > + } > } > > /* Check that a pmd has been selected. */ > @@ -6036,12 +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list > *numa_list, > pmd_numa_id = sched_pmd->numa->numa_id; > /* Check if selected pmd numa matches port numa. */ > if (pmd_numa_id != port_numa_id) { > - VLOG(level, "There's no available (non-isolated) pmd thread " > - "on numa node %d. Port \'%s\' rx queue %d will " > - "be assigned to a pmd on numa node %d. " > - "This may lead to reduced performance.", > - port_numa_id, netdev_rxq_get_name(rxq->rx), > - netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id); > + if (cross_numa) { > + VLOG(level, "Cross-numa polling has been selected for " > + "Port \'%s\' rx queue %d on numa node %d. " > + "It will be assigned to a pmd on numa node " > + "%d. This may lead to reduced performance.", > + netdev_rxq_get_name(rxq->rx), > + netdev_rxq_get_queue_id(rxq->rx), > port_numa_id, > + pmd_numa_id); > + > + } else { > + VLOG(level, "There's no available (non-isolated) pmd " > + "thread on numa node %d. Port \'%s\' rx > queue " > + "%d will be assigned to a pmd on numa node " > + "%d. This may lead to reduced performance.", > + port_numa_id, netdev_rxq_get_name(rxq->rx), > + netdev_rxq_get_queue_id(rxq->rx), > pmd_numa_id); > + } > } > VLOG(level, "Core %2u on numa node %d assigned port \'%s\' " > "rx queue %d%s.", > diff --git a/tests/pmd.at b/tests/pmd.at > index e6b173dab..5d2ec1382 100644 > --- a/tests/pmd.at > +++ b/tests/pmd.at > @@ -579,6 +579,44 @@ > icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10 > OVS_VSWITCHD_STOP > AT_CLEANUP > > +AT_SETUP([PMD - enable cross numa polling]) > +OVS_VSWITCHD_START([add-port br0 p0 \ > + -- set Interface p0 type=dummy-pmd options:n_rxq=4 \ > + -- set Interface p0 options:numa_id=0 \ > + -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \ > + -- set open_vswitch . other_config:pmd-rxq-assign=group], > + [], [], [--dummy-numa 0,1]) > + > +AT_CHECK([ovs-ofctl add-flow br0 action=controller]) > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f > 3 -d ' ' | sort | uniq], [0], [dnl > +0 > +]) > + > +dnl Enable cross numa polling and check numa ids > +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > +PATTERN="cross-numa-polling has been enabled" > +AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=true]) > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"]) > + > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f > 3 -d ' ' | sort | uniq], [0], [dnl > +0 > +1 > +]) > + > +dnl Disable cross numa polling and check numa ids > +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1)) > +PATTERN="cross-numa-polling has been disabled" > +AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=false]) > +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"]) > + > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f > 3 -d ' ' | sort | uniq], [0], [dnl > +0 > +]) > + > +OVS_VSWITCHD_STOP(["/|WARN|/d"]) > +AT_CLEANUP > + > + > AT_SETUP([PMD - change numa node]) > OVS_VSWITCHD_START( > [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 > options:n_rxq=2 -- \ > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index cc1dd77ec..c2e948bd9 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 > type=patch options:peer=p1 \ > </p> > </column> > > + <column name="other_config" key="cross-numa-polling" > + type='{"type": "boolean"}'> > + <p> > + Specifies if the RX queues of the port can be automatically > assigned > + to PMD threads on any NUMA node or only on the local NUMA node of > + the port. > + </p> > + <p> > + Polling of physical ports from a non-local PMD thread incurs some > + performance penalty due to the access to packet data across the > NUMA > + barrier. This option can still increase the overall performance if > + it allows better utilization of those non-local PMDs threads. > + It is most useful together with the auto load-balancing of RX > queues > + <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/> > + </p> > + <p> > + Defaults to false. > + </p> > + </column> > + > <column name="options" key="xdp-mode" > type='{"type": "string", > "enum": ["set", ["best-effort", "native-with-zerocopy", > -- > 2.25.1 > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev