On Wed, Jun 08, 2022 at 03:58:31PM +0530, Anurag Agarwal wrote:
> From: Jan Scheurich <jan.scheur...@ericsson.com>
> 
> Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic
> assignment of the rxqs of a port only if there are no local,non-isolated PMDs.
> 
> On typical servers with both physical ports on one NUMA node, this often
> leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources.
> The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also
> has drawbacks as it limits OVS' ability to auto load-balance the rxqs.
> 
> This patch introduces a new interface configuration option to allow ports to
> be automatically polled by PMDs on any NUMA node:
> 
> ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
> 
> The group assignment algorithm now has the ability to select lowest loaded PMD
> on any NUMA, and not just the local NUMA on which the rxq of the port resides
> 
> If this option is not present or set to false, legacy behaviour applies.
> 
> Co-authored-by: Anurag Agarwal <anurag.agar...@ericsson.com>
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> Signed-off-by: Anurag Agarwal <anurag.agar...@ericsson.com>
> ---
> 
> Changes in this patch:
> - Addressed comments from Kevin Traynor
> 
> Please refer this thread for an earlier discussion on this topic:
> https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392310.html
> ---
>  Documentation/topics/dpdk/pmd.rst |  23 ++++++
>  lib/dpif-netdev.c                 | 130 ++++++++++++++++++++++--------
>  tests/pmd.at                      |  38 +++++++++
>  vswitchd/vswitch.xml              |  20 +++++
>  4 files changed, 177 insertions(+), 34 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..387f962d1 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -99,6 +99,25 @@ core cycles for each Rx queue::
>  
>      $ ovs-appctl dpif-netdev/pmd-rxq-show
>  
> +Normally, Rx queues are assigned to PMD threads automatically.  By default
> +OVS only assigns Rx queues to PMD threads executing on the same NUMA
> +node in order to avoid unnecessary latency for accessing packet buffers
> +across the NUMA boundary.  Typically this overhead is higher for vhostuser
> +ports than for physical ports due to the packet copy that is done for all
> +rx packets.
> +
> +On NUMA servers with physical ports only on one NUMA node, the NUMA-local
> +polling policy can lead to an under-utilization of the PMD threads on the
> +remote NUMA node.  For the overall OVS performance it may in such cases be
> +beneficial to utilize the spare capacity and allow polling of a physical
> +port's rxqs across NUMA nodes despite the overhead involved.
> +The policy can be set per port with the following configuration option::
> +
> +    $ ovs-vsctl set Interface <iface> \
> +        other_config:cross-numa-polling=true|false
> +
> +The default value is false.
> +
>  .. note::
>  
>     A history of one minute is recorded and shown for each Rx queue to allow 
> for
> @@ -115,6 +134,10 @@ core cycles for each Rx queue::
>     A ``overhead`` statistics is shown per PMD: it represents the number of
>     cycles inherently consumed by the OVS PMD processing loop.
>  
> +.. versionchanged:: 2.18.0
> +
> +   Added the interface parameter ``other_config:cross-numa-polling``
> +
>  Rx queue to PMD assignment takes place whenever there are configuration 
> changes
>  or can be triggered by using::
>  
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index ff57b3961..86f88964b 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -467,6 +467,7 @@ struct dp_netdev_port {
>      char *type;                 /* Port type as requested by user. */
>      char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>      enum txq_req_mode txq_requested_mode;
> +    bool cross_numa_polling;    /* If true cross polling will be enabled. */
>  };
>  
>  static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
> @@ -2101,6 +2102,7 @@ port_create(const char *devname, const char *type,
>      port->sf = NULL;
>      port->emc_enabled = true;
>      port->need_reconfigure = true;
> +    port->cross_numa_polling = false;
>      ovs_mutex_init(&port->txq_used_mutex);
>  
>      *portp = port;
> @@ -5013,6 +5015,7 @@ dpif_netdev_port_set_config(struct dpif *dpif, 
> odp_port_t port_no,
>      bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
>      const char *tx_steering_mode = smap_get(cfg, "tx-steering");
>      enum txq_req_mode txq_mode;
> +    bool cross_numa_polling = smap_get_bool(cfg, "cross-numa-polling", 
> false);
>  
>      ovs_rwlock_wrlock(&dp->port_rwlock);
>      error = get_port_by_number(dp, port_no, &port);
> @@ -5020,6 +5023,14 @@ dpif_netdev_port_set_config(struct dpif *dpif, 
> odp_port_t port_no,
>          goto unlock;
>      }
>  
> +    if (cross_numa_polling != port->cross_numa_polling) {
> +        port->cross_numa_polling = cross_numa_polling;
> +        VLOG_INFO("%s:cross-numa-polling has been %s.",
> +                  netdev_get_name(port->netdev),
> +                  cross_numa_polling? "enabled" : "disabled");
> +        dp_netdev_request_reconfigure(dp);
> +    }
> +
>      if (emc_enabled != port->emc_enabled) {
>          struct dp_netdev_pmd_thread *pmd;
>          struct ds ds = DS_EMPTY_INITIALIZER;
> @@ -5751,7 +5762,7 @@ compare_rxq_cycles(const void *a, const void *b)
>  
>  static bool
>  sched_pmd_new_lowest(struct sched_pmd *current_lowest, struct sched_pmd *pmd,
> -                     bool has_proc)
> +                     int port_numa_id, bool has_proc)
>  {
>      uint64_t current_num, pmd_num;
>  
> @@ -5767,16 +5778,22 @@ sched_pmd_new_lowest(struct sched_pmd 
> *current_lowest, struct sched_pmd *pmd,
>          pmd_num = pmd->n_rxq;
>      }
>  
> -    if (pmd_num < current_num) {
> -        return true;
> +    if (pmd_num != current_num) {
> +        return (pmd_num < current_num) ? true : false;
>      }
> -    return false;
> +
> +    /* In case of a tie, select the new PMD only if the existing PMD
> +     * assigned is from non-local NUMA
> +     */
> +    return (current_lowest->numa->numa_id != port_numa_id &&
> +            pmd->numa->numa_id == port_numa_id) ? true : false;
> +
>  }
>  
>  static struct sched_pmd *
>  sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
>  {
> -    struct sched_pmd *lowest_sched_pmd = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;
>  
>      for (unsigned i = 0; i < numa->n_pmds; i++) {
>          struct sched_pmd *sched_pmd;
> @@ -5785,11 +5802,11 @@ sched_pmd_get_lowest(struct sched_numa *numa, bool 
> has_cyc)
>          if (sched_pmd->isolated) {
>              continue;
>          }
> -        if (sched_pmd_new_lowest(lowest_sched_pmd, sched_pmd, has_cyc)) {
> -            lowest_sched_pmd = sched_pmd;
> +        if (sched_pmd_new_lowest(lowest_pmd, sched_pmd, INT_MAX, has_cyc)) {
> +            lowest_pmd = sched_pmd;
>          }
>      }
> -    return lowest_sched_pmd;
> +    return lowest_pmd;
>  }
>  
>  /*
> @@ -5891,6 +5908,32 @@ get_rxq_cyc_log(char *a, enum sched_assignment_type 
> algo, uint64_t cycles)
>      return a;
>  }
>  
> +static struct sched_pmd *
> +sched_pmd_all_numa_get_lowest(struct sched_numa_list *numa_list,
> +                              int port_numa_id, bool has_proc) {
> +    int n_numa;
> +    struct sched_numa *numa = NULL;
> +    struct sched_numa *last_numa = NULL;
> +    struct sched_pmd *lowest_pmd = NULL;
> +    struct sched_pmd *pmd;
> +
> +    n_numa = sched_numa_list_count(numa_list);
> +    /* For all numas. */
> +    for (int i = 0; i < n_numa; i++) {
> +        last_numa = numa;
> +        numa = sched_numa_list_next(numa_list, last_numa);
> +
> +        /* Get the lowest pmd per numa. */
> +        pmd = sched_pmd_get_lowest(numa, has_proc);
> +
> +        /* Check if it's the lowest pmd for all numas. */
> +        if (sched_pmd_new_lowest(lowest_pmd, pmd, port_numa_id, has_proc)) {
> +            lowest_pmd = pmd;
> +        }
> +    }
> +    return lowest_pmd;
> +}
> +
>  static void
>  sched_numa_list_schedule(struct sched_numa_list *numa_list,
>                           struct dp_netdev *dp,
> @@ -5991,6 +6034,7 @@ sched_numa_list_schedule(struct sched_numa_list 
> *numa_list,
>          struct sched_pmd *sched_pmd = NULL;
>          struct sched_numa *numa;
>          int port_numa_id;
> +        bool cross_numa = rxq->port->cross_numa_polling;

It may be better to schedule non-cross-numa rxqs first, and then cross-numa 
rxqs.
Otherwise, one numa may hold much more load because of later scheduled
non-cross-numa rxqs.

>          uint64_t proc_cycles;
>          char rxq_cyc_log[MAX_RXQ_CYC_STRLEN];
>  
> @@ -6000,33 +6044,40 @@ sched_numa_list_schedule(struct sched_numa_list 
> *numa_list,
>              start_logged = true;
>          }
>  
> -        /* Store the cycles for this rxq as we will log these later. */
> +        /* Store the cycles for this rxq as we will need these later. */
>          proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST);
>  
>          port_numa_id = netdev_get_numa_id(rxq->port->netdev);
>  
> -        /* Select numa. */
> -        numa = sched_numa_list_lookup(numa_list, port_numa_id);
> -
> -        /* Check if numa has no PMDs or no non-isolated PMDs. */
> -        if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> -            /* Unable to use this numa to find a PMD. */
> -            numa = NULL;
> -            /* Find any numa with available PMDs. */
> -            for (int j = 0; j < n_numa; j++) {
> -                numa = sched_numa_list_next(numa_list, last_cross_numa);
> -                last_cross_numa = numa;
> -                if (sched_numa_noniso_pmd_count(numa)) {
> -                    break;
> -                }
> +        if (cross_numa && algo == SCHED_GROUP) {
> +            /* cross_numa polling enabled so find lowest loaded pmd across
> +             * all numas. */
> +            sched_pmd = sched_pmd_all_numa_get_lowest(numa_list, 
> port_numa_id,
> +                                                      proc_cycles);
> +        } else {
> +            /* Select numa. */
> +            numa = sched_numa_list_lookup(numa_list, port_numa_id);
> +
> +            /* Check if numa has no PMDs or no non-isolated PMDs. */
> +            if (!numa || !sched_numa_noniso_pmd_count(numa)) {
> +                /* Unable to use this numa to find a PMD. */
>                  numa = NULL;
> +                /* Find any numa with available PMDs. */
> +                for (int j = 0; j < n_numa; j++) {
> +                    numa = sched_numa_list_next(numa_list, last_cross_numa);
> +                    last_cross_numa = numa;
> +                    if (sched_numa_noniso_pmd_count(numa)) {
> +                        break;
> +                    }
> +                    numa = NULL;
> +                }
>              }
> -        }
>  
> -        if (numa) {
> -            /* Select the PMD that should be used for this rxq. */
> -            sched_pmd = sched_pmd_next(numa, algo,
> -                                       proc_cycles ? true : false);
> +            if (numa) {
> +                /* Select the PMD that should be used for this rxq. */
> +                sched_pmd = sched_pmd_next(numa, algo,
> +                                           proc_cycles ? true : false);
> +            }
>          }
>  
>          /* Check that a pmd has been selected. */
> @@ -6036,12 +6087,23 @@ sched_numa_list_schedule(struct sched_numa_list 
> *numa_list,
>              pmd_numa_id = sched_pmd->numa->numa_id;
>              /* Check if selected pmd numa matches port numa. */
>              if (pmd_numa_id != port_numa_id) {
> -                VLOG(level, "There's no available (non-isolated) pmd thread "
> -                            "on numa node %d. Port \'%s\' rx queue %d will "
> -                            "be assigned to a pmd on numa node %d. "
> -                            "This may lead to reduced performance.",
> -                            port_numa_id, netdev_rxq_get_name(rxq->rx),
> -                            netdev_rxq_get_queue_id(rxq->rx), pmd_numa_id);
> +                if (cross_numa) {
> +                    VLOG(level, "Cross-numa polling has been selected for "
> +                                "Port \'%s\' rx queue %d on numa node %d. "
> +                                "It will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), 
> port_numa_id,
> +                                pmd_numa_id);
> +
> +                } else {
> +                    VLOG(level, "There's no available (non-isolated) pmd "
> +                                "thread on numa node %d. Port \'%s\' rx 
> queue "
> +                                "%d will be assigned to a pmd on numa node "
> +                                "%d. This may lead to reduced performance.",
> +                                port_numa_id, netdev_rxq_get_name(rxq->rx),
> +                                netdev_rxq_get_queue_id(rxq->rx), 
> pmd_numa_id);
> +                }
>              }
>              VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
>                          "rx queue %d%s.",
> diff --git a/tests/pmd.at b/tests/pmd.at
> index e6b173dab..5d2ec1382 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -579,6 +579,44 @@ 
> icmp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
>  
> +AT_SETUP([PMD - enable cross numa polling])
> +OVS_VSWITCHD_START([add-port br0 p0 \
> +                    -- set Interface p0 type=dummy-pmd options:n_rxq=4 \
> +                    -- set Interface p0 options:numa_id=0 \
> +                    -- set Open_vSwitch . other_config:pmd-cpu-mask=0x3 \
> +                    -- set open_vswitch . other_config:pmd-rxq-assign=group],
> +                   [], [], [--dummy-numa 0,1])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 action=controller])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 
> 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +dnl Enable cross numa polling and check numa ids
> +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> +PATTERN="cross-numa-polling has been enabled"
> +AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=true])
> +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 
> 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +1
> +])
> +
> +dnl Disable cross numa polling and check numa ids
> +TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
> +PATTERN="cross-numa-polling has been disabled"
> +AT_CHECK([ovs-vsctl set Interface p0 other_config:cross-numa-polling=false])
> +OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "$PATTERN"])
> +
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show | cut -f 
> 3 -d ' ' | sort |      uniq], [0], [dnl
> +0
> +])
> +
> +OVS_VSWITCHD_STOP(["/|WARN|/d"])
> +AT_CLEANUP
> +
> +
>  AT_SETUP([PMD - change numa node])
>  OVS_VSWITCHD_START(
>    [add-port br0 p1 -- set Interface p1 type=dummy-pmd ofport_request=1 
> options:n_rxq=2 -- \
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index cc1dd77ec..c2e948bd9 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3286,6 +3286,26 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="other_config" key="cross-numa-polling"
> +                                  type='{"type": "boolean"}'>
> +        <p>
> +          Specifies if the RX queues of the port can be automatically 
> assigned
> +          to PMD threads on any NUMA node or only on the local NUMA node of
> +          the port.
> +        </p>
> +        <p>
> +          Polling of physical ports from a non-local PMD thread incurs some
> +          performance penalty due to the access to packet data across the 
> NUMA
> +          barrier. This option can still increase the overall performance if
> +          it allows better utilization of those non-local PMDs threads.
> +          It is most useful together with the auto load-balancing of RX 
> queues
> +          <ref table="Open_vSwitch" column="other_config" key="pmd-auto-lb"/>
> +        </p>
> +        <p>
> +          Defaults to false.
> +        </p>
> +      </column>
> +
>        <column name="options" key="xdp-mode"
>                type='{"type": "string",
>                       "enum": ["set", ["best-effort", "native-with-zerocopy",
> -- 
> 2.25.1
> 
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to