Re: [ovs-dev] [RFC PATCH v1] dpdk: Support both shared and per port mempools.

Stokes, Ian Wed, 30 May 2018 03:26:01 -0700

> On 05/18/2018 03:30 PM, Ian Stokes wrote:
> > This commit re-introduces the concept of shared mempools as the
> > default memory model for DPDK devices. Per port mempools are still
> > available but must be enabled explicitly by a user.
> >
> > OVS previously used a shared mempool model for ports with the same MTU
> > and socket configuration. This was replaced by a per port mempool
> > model to address issues flagged by users such as:
> >
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/0425
> > 60.html
> >
> > However the per port model potentially requires an increase in memory
> > resource requirements to support the same number of ports and
> > configuration as the shared port model.
> >
> > This is considered a blocking factor for current deployments of OVS
> > when upgrading to future OVS releases as a user may have to
> > redimension memory for the same deployment configuration. This may not
> > be possible for users.
> >
> 
> Hi Ian, thanks for addressing this. Few comments below...
> 
> > This commit resolves the issue by re-introducing shared mempools as
> > the default memory behaviour in OVS DPDK but also refactors the memory
> > configuration code to allow for per port mempools.
> 
> > This patch adds a new global config option, per-port-mp-enabled, that
> > controls the enablement of per port mempools for DPDK devices.
> >
> >     ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> >
> 
> It doesn't need enabled in the name. If say 'per-port-mempool=true/false',
> then it's already obvious it's enabled.
> Actually, I wonder if 'per-port-memory' would be better as it doesn't
> require a user to know what a DPDK mempool is.
>


Agree, will change in the v2.

> > This value defaults to false; to enable per port mempool support, this
> > field should be set to true when setting other global parameters on
> > init (such as "dpdk-socket-mem", for example). Changing the value at
> > runtime is not supported, and requires restarting the vswitch daemon.
> >
> > The mempool sweep functionality is also replaced with the sweep
> > functionality from OVS 2.9 found in commits
> >
> > c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> > a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> >
> > As this patch is RFC there are a number of TO-DOs including adding a
> > memory calculation section to the documentation for both models. This
> > is expected to be completed in the v1 after RFC.
> >
> > Signed-off-by: Ian Stokes <ian.sto...@intel.com>
> > ---
> >  Documentation/automake.mk            |   1 +
> >  Documentation/topics/dpdk/index.rst  |   1 +
> >  Documentation/topics/dpdk/memory.rst |  67 +++++++
> >  NEWS                                 |   1 +
> >  lib/dpdk-stub.c                      |   6 +
> >  lib/dpdk.c                           |  13 ++
> >  lib/dpdk.h                           |   1 +
> >  lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++-------
> -------
> >  vswitchd/vswitch.xml                 |  16 ++
> >  9 files changed, 305 insertions(+), 127 deletions(-)  create mode
> > 100644 Documentation/topics/dpdk/memory.rst
> >
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index 683ca14..14c2189 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -36,6 +36,7 @@ DOC_SOURCE = \
> >     Documentation/topics/dpdk/index.rst \
> >     Documentation/topics/dpdk/bridge.rst \
> >     Documentation/topics/dpdk/jumbo-frames.rst \
> > +   Documentation/topics/dpdk/memory.rst \
> >     Documentation/topics/dpdk/pdump.rst \
> >     Documentation/topics/dpdk/phy.rst \
> >     Documentation/topics/dpdk/pmd.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst
> > b/Documentation/topics/dpdk/index.rst
> > index 181f61a..cf24a7b 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,3 +40,4 @@ The DPDK Datapath
> >     /topics/dpdk/qos
> >     /topics/dpdk/pdump
> >     /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/memory.rst
> > b/Documentation/topics/dpdk/memory.rst
> > new file mode 100644
> > index 0000000..1198067
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/memory.rst
> > @@ -0,0 +1,67 @@
> > +..
> > +        Copyright 2018, Intel, Inc.
> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License");
> you may
> > +      not use this file except in compliance with the License. You may
> obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing,
> software
> > +      distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the
> > +      License for the specific language governing permissions and
> limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +=========================
> > +DPDK Device Memory Models
> > +=========================
> > +
> > +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> > +shared mempools or per port mempools. The specifics of both are
> > +detailed below.
> > +
> > +Shared Mempools
> > +---------------
> > +
> > +By Default OVS DPDK uses a shared mempool model. This means that
> > +multiple ports can share the same mempool. For example when a port is
> > +added it will have a given MTU and socket ID associated with it. If a
> > +mempool has been created previously for an existing port that has the
> > +same MTU and socket ID, that mempool is used for both ports. If there
> > +is no existing mempool supporting these parameters then a new mempool
> is created.
> > +
> > +
> > +Per Port Mempools
> > +-----------------
> > +
> > +In the per port mempool model, mempools are created per device and
> > +are not shared. The benefit of this is a more transparent memory
> > +model where mempools will not be exhausted by other DPDK devices.
> > +However this comes at a potential increase in cost for memory
> > +dimensioning for a given deployment. Users should be aware of the
> > +memory requirements for their deployment before using this model and
> allocate the required hugepage memory.
> > +
> > +Per port mempool support may be enabled via a global config value,
> > +```per-port-mp-enabled```. Setting this to true enables the per port
> > +mempool model for all DPDK devices in OVS::
> > +
> > +   $ ovs-vsctl set Open_vSwitch .
> > + other_config:per-port-mp-enabled=true
> > +
> > +.. important::
> > +
> > +    Changing this value requires restarting the daemon.
> > +
> 
> It's a good general guide, but the daemon does not need to be restarted
> once it's set before dpdk-init=true.

Good point, on a separate note I think this would also apply to the vhost iommu 
option also? (It also requests daemon restart in documentation/commit 
messages). Will look into it.

> 
> > +.. todo::
> > +
> > +   Add memory calculation section for both mempool models.
> > diff --git a/NEWS b/NEWS
> > index ec548b0..c9991cf 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -30,6 +30,7 @@ Post-v2.9.0
> >       * New 'check-dpdk' Makefile target to run a new system testsuite.
> >         See Testing topic for the details.
> >       * Add LSC interrupt support for DPDK physical devices.
> > +     * Support both shared and per port mempools for DPDK devices.
> >     - Userspace datapath:
> >       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
> single PMD
> >       * Detailed PMD performance metrics available with new command
> > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 041cd0c..d2a9718
> > 100644
> > --- a/lib/dpdk-stub.c
> > +++ b/lib/dpdk-stub.c
> > @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
> >      return false;
> >  }
> >
> > +bool
> > +dpdk_per_port_mempool_enabled(void)
> > +{
> > +    return false;
> > +}
> > +
> >  void
> >  print_dpdk_version(void)
> >  {
> > diff --git a/lib/dpdk.c b/lib/dpdk.c
> > index 00dd974..e251970 100644
> > --- a/lib/dpdk.c
> > +++ b/lib/dpdk.c
> > @@ -43,6 +43,8 @@ static FILE *log_stream = NULL;       /* Stream for
> DPDK log redirection */
> >
> >  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets
> */
> >  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
> > support */
> > +static bool per_port_mp_enabled = false; /* Status of per port mempool
> > +                                          * support */
> >
> >  static int
> >  process_vhost_flags(char *flag, const char *default_val, int size, @@
> > -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
> >      VLOG_INFO("IOMMU support for vhost-user-client %s.",
> >                 vhost_iommu_enabled ? "enabled" : "disabled");
> >
> > +    per_port_mp_enabled = smap_get_bool(ovs_other_config,
> > +                                        "per-port-mp-enabled", false);
> > +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> > +               per_port_mp_enabled ? "enabled" : "disabled");
> > +
> >      argv = grow_argv(&argv, 0, 1);
> >      argc = 1;
> >      argv[0] = xstrdup(ovs_get_program_name()); @@ -498,6 +505,12 @@
> > dpdk_vhost_iommu_enabled(void)
> >      return vhost_iommu_enabled;
> >  }
> >
> > +bool
> > +dpdk_per_port_mempool_enabled(void)
> > +{
> > +    return per_port_mp_enabled;
> > +}
> > +
> >  void
> >  dpdk_set_lcore_id(unsigned cpu)
> >  {
> > diff --git a/lib/dpdk.h b/lib/dpdk.h
> > index b041535..243385c 100644
> > --- a/lib/dpdk.h
> > +++ b/lib/dpdk.h
> > @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
> > void dpdk_set_lcore_id(unsigned cpu);  const char
> > *dpdk_get_vhost_sock_dir(void);  bool dpdk_vhost_iommu_enabled(void);
> > +bool dpdk_per_port_mempool_enabled(void);
> >  void print_dpdk_version(void);
> >
> >  #endif /* dpdk.h */
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > 87152a7..cda2ace 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
> >  #define NETDEV_DPDK_MBUF_ALIGN      1024
> >  #define NETDEV_DPDK_MAX_PKT_LEN     9728
> >
> > -/* Min number of packets in the mempool.  OVS tries to allocate a
> > mempool with
> > - * roughly estimated number of mbufs: if this fails (because the
> > system doesn't
> > - * have enough hugepages) we keep halving the number until the
> > allocation
> > - * succeeds or we reach MIN_NB_MBUF */
> > +/* Max and min number of packets in the mempool.  OVS tries to
> > +allocate a
> > + * mempool with MAX_NB_MBUF: if this fails (because the system
> > +doesn't have
> > + * enough hugepages) we keep halving the number until the allocation
> > +succeeds
> > + * or we reach MIN_NB_MBUF */
> > +
> > +#define MAX_NB_MBUF          (4096 * 64)
> >  #define MIN_NB_MBUF          (4096 * 4)
> >  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
> >
> > +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> > +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
> MIN_NB_MBUF)
> > +                  == 0);
> > +
> > +/* The smallest possible NB_MBUF that we're going to try should be a
> > +multiple
> > + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> > +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
> MIN_NB_MBUF))
> > +                  % MP_CACHE_SZ == 0);
> > +
> >  /*
> >   * DPDK XSTATS Counter names definition
> >   */
> > @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list
> > OVS_GUARDED_BY(dpdk_mutex)  static struct ovs_mutex dpdk_mp_mutex
> OVS_ACQ_AFTER(dpdk_mutex)
> >      = OVS_MUTEX_INITIALIZER;
> >
> > -/* Contains all 'struct dpdk_mp's. */
> 
> Not strictly needed, but comment is still valid

This was lost in rebasing I think, will return it in v2.

> 
> > -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> > -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> > +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> > +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
> >
> > -/* Wrapper for a mempool released but not yet freed. */  struct
> > dpdk_mp {
> >       struct rte_mempool *mp;
> > +     int mtu;
> > +     int socket_id;
> > +     int refcount;
> >       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> >   };
> >
> > @@ -381,7 +393,7 @@ struct netdev_dpdk {
> >
> >      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
> >          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> > -        struct rte_mempool *mp;
> > +        struct dpdk_mp *dpdk_mp;
> >
> >          /* virtio identifier for vhost devices */
> >          ovsrcu_index vid;
> > @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp
> OVS_UNUSED,
> >      dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);  }
> >
> > -static int
> > -dpdk_mp_full(const struct rte_mempool *mp)
> > OVS_REQUIRES(dpdk_mp_mutex) -{
> > -    unsigned ring_count;
> > -    /* This logic is needed because rte_mempool_full() is not
> guaranteed to
> > -     * be atomic and mbufs could be moved from mempool cache -->
> mempool ring
> > -     * during the call. However, as no mbufs will be taken from the
> mempool
> > -     * at this time, we can work around it by also checking the ring
> entries
> > -     * separately and ensuring that they have not changed.
> > -     */
> > -    ring_count = rte_mempool_ops_get_count(mp);
> > -    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) ==
> ring_count) {
> > -        return 1;
> > -    }
> > -
> > -    return 0;
> > -}
> > -
> > -/* Free unused mempools. */
> > -static void
> > -dpdk_mp_sweep(void)
> > +/* Calculating the required number of mbufs differs depending on the
> > + * mempool model being used. Check if per port mempools are in use
> > +before
> > + * calculating.
> > + */
> > +static uint32_t
> > +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
> > +per_port_mp)
> >  {
> > -    struct dpdk_mp *dmp, *next;
> > +    uint32_t n_mbufs;
> >
> > -    ovs_mutex_lock(&dpdk_mp_mutex);
> > -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> > -        if (dpdk_mp_full(dmp->mp)) {
> > -            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> > -            ovs_list_remove(&dmp->list_node);
> > -            rte_mempool_free(dmp->mp);
> > -            rte_free(dmp);
> > +    if (!per_port_mp) {
> > +        /* Shared mempools are being used.
> > +         * XXX: this is a really rough method of provisioning memory.
> 
> XXX causes warnings in checkpatch (found that out on rte_mempool_full
> patch)
> 

Yes, I spotted this myself before sending the patch :). The guideline agreed as 
regards 'XXX' in patches is it's allowed so as to flag future work (and ensure 
it's not just something the author missed).

There was a fair bit of discussion as regards the memory calculations on the ML 
previously so I expect it will change in the future, possibly a reduction or 
removal of the tx queues specific calculations for the per port implementation.

For the v1 here I kept the comment just to highlight it could be reworked in 
the future.
 
> > +         * It's impossible to determine what the exact memory
> requirements are
> > +         * when the number of ports and rxqs that utilize a particular
> mempool
> > +         * can change dynamically at runtime. For now, use this rough
> > +         * heurisitic.
> > +         */
> > +        if (mtu >= ETHER_MTU) {
> > +            n_mbufs = MAX_NB_MBUF;
> > +        } else {
> > +            n_mbufs = MIN_NB_MBUF;
> >          }
> > +    } else {
> > +        /* Per port mempools are being used.
> > +         * XXX: rough estimation of number of mbufs required for this
> port:
> > +         * <packets required to fill the device rxqs>
> > +         * + <packets that could be stuck on other ports txqs>
> > +         * + <packets in the pmd threads>
> > +         * + <additional memory for corner cases>
> > +         */
> > +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> > +                  + dev->requested_n_txq * dev->requested_txq_size
> > +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
> > +                  + MIN_NB_MBUF;
> >      }
> > -    ovs_mutex_unlock(&dpdk_mp_mutex);
> > -}
> > -
> > -/* Ensure a mempool will not be freed. */ -static void
> > -dpdk_mp_do_not_free(struct rte_mempool *mp)
> > OVS_REQUIRES(dpdk_mp_mutex) -{
> > -    struct dpdk_mp *dmp, *next;
> >
> > -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> > -        if (dmp->mp == mp) {
> > -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp-
> >name);
> > -            ovs_list_remove(&dmp->list_node);
> > -            rte_free(dmp);
> > -            break;
> > -        }
> > -    }
> > +    return n_mbufs;
> >  }
> >
> > -/* Returns a valid pointer when either of the following is true:
> > - *  - a new mempool was just created;
> > - *  - a matching mempool already exists. */ -static struct
> > rte_mempool * -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> > +static struct dpdk_mp *
> > +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> >  {
> >      char mp_name[RTE_MEMPOOL_NAMESIZE];
> >      const char *netdev_name = netdev_get_name(&dev->up);
> >      int socket_id = dev->requested_socket_id;
> >      uint32_t n_mbufs;
> >      uint32_t hash = hash_string(netdev_name, 0);
> > -    struct rte_mempool *mp = NULL;
> > -
> > -    /*
> > -     * XXX: rough estimation of number of mbufs required for this port:
> > -     * <packets required to fill the device rxqs>
> > -     * + <packets that could be stuck on other ports txqs>
> > -     * + <packets in the pmd threads>
> > -     * + <additional memory for corner cases>
> > -     */
> > -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> > -              + dev->requested_n_txq * dev->requested_txq_size
> > -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
> > -              + MIN_NB_MBUF;
> > +    struct dpdk_mp *dmp = NULL;
> > +    int ret;
> > +
> > +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> > +    if (!dmp) {
> > +        return NULL;
> > +    }
> > +    dmp->socket_id = socket_id;
> > +    dmp->mtu = mtu;
> > +    dmp->refcount = 1;
> > +
> > +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
> >
> > -    ovs_mutex_lock(&dpdk_mp_mutex);
> >      do {
> >          /* Full DPDK memory pool name must be unique and cannot be
> >           * longer than RTE_MEMPOOL_NAMESIZE. */
> > -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> > +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> >                             "ovs%08x%02d%05d%07u",
> >                             hash, socket_id, mtu, n_mbufs);
> 
> you're using the hash of one dev in the name, but that mempool can later
> be used by other dev, and even be no longer used by the original dev.
> OTOH no one is going to know what the hash means anyway, so it shouldn't
> confuse any more on debug.

I agree, it's just to ensure a unique name for the mempool in per port case and 
would not have an impact for the shared model.

I'll put a comment to explain this and that it's expected in the v2.

> 
> >          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) { @@ -623,102
> > +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> >                    netdev_name, n_mbufs, socket_id,
> >                    dev->requested_n_rxq, dev->requested_n_txq);
> >
> > -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> > -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> > -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> > +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> > +                                          MP_CACHE_SZ,
> > +                                          sizeof (struct dp_packet)
> > +                                          - sizeof (struct rte_mbuf),
> > +                                          MBUF_SIZE(mtu)
> > +                                          - sizeof(struct dp_packet),
> > +                                          socket_id);
> >
> > -        if (mp) {
> > +        if (dmp->mp) {
> >              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
> >                       mp_name, n_mbufs);
> >              /* rte_pktmbuf_pool_create has done some initialization of
> the
> > -             * rte_mbuf part of each dp_packet. Some OvS specific
> fields
> > -             * of the packet still need to be initialized by
> > -             * ovs_rte_pktmbuf_init. */
> > -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> > +             * rte_mbuf part of each dp_packet, while
> ovs_rte_pktmbuf_init
> > +             * initializes some OVS specific fields of dp_packet.
> > +             */
> > +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> > +            return dmp;
> >          } else if (rte_errno == EEXIST) {
> >              /* A mempool with the same name already exists.  We just
> >               * retrieve its pointer to be returned to the caller. */
> > -            mp = rte_mempool_lookup(mp_name);
> > +            dmp->mp = rte_mempool_lookup(mp_name);
> >              /* As the mempool create returned EEXIST we can expect the
> >               * lookup has returned a valid pointer.  If for some reason
> >               * that's not the case we keep track of it. */
> >              VLOG_DBG("A mempool with name \"%s\" already exists at
> %p.",
> > -                     mp_name, mp);
> > -            /* Ensure this reused mempool will not be freed. */
> > -            dpdk_mp_do_not_free(mp);
> 
> The mempool you are about to reuse could have a refcount of 0 and about to
> be freed by a sweep. So you would need something like the function above
> before giving up dpdk_mp_mutex. Maybe you could increase the refcount for
> it now and re-adjust later if you need to.
> 

So my thinking here was that we run the sweep in 
netdev_dpdk_mempool_configure() just prior to calling dpdk_mp_create().

Running the sweep prior should have removed any mempool with refcount 0 in the 
dpdk_mp_list.
The list itself is mutex guarded, if there is an existing mempool then the 
associated refcount would still be 1 I thought but maybe I missed something.

Do you think it's the case it could be modified between the sweep call and 
reaching this point?

> > +                     mp_name, dmp->mp);
> >          } else {
> >              VLOG_ERR("Failed mempool \"%s\" create request of %u
> mbufs",
> >                       mp_name, n_mbufs);
> >          }
> > -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
> MIN_NB_MBUF);
> > +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
> > + MIN_NB_MBUF);
> >
> > -    ovs_mutex_unlock(&dpdk_mp_mutex);
> > -    return mp;
> > +    rte_free(dmp);
> > +    return NULL;
> >  }
> >
> > -/* Release an existing mempool. */
> > +static int
> > +dpdk_mp_full(const struct rte_mempool *mp)
> > +OVS_REQUIRES(dpdk_mp_mutex) {
> > +    unsigned ring_count;
> > +    /* This logic is needed because rte_mempool_full() is not
> guaranteed to
> > +     * be atomic and mbufs could be moved from mempool cache -->
> mempool ring
> > +     * during the call. However, as no mbufs will be taken from the
> mempool
> > +     * at this time, we can work around it by also checking the ring
> entries
> > +     * separately and ensuring that they have not changed.
> > +     */
> > +    ring_count = rte_mempool_ops_get_count(mp);
> > +    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) ==
> ring_count) {
> > +        return 1;
> > +    }
> > +
> > +    return 0;
> 
> you'll need to rebase because of the changes here. Won't make any
> difference to your operation though
> 
> > +}
> > +
> > +/* Free unused mempools. */
> >  static void
> > -dpdk_mp_release(struct rte_mempool *mp)
> > +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
> >  {
> > -    if (!mp) {
> > -        return;
> > +    struct dpdk_mp *dmp, *next;
> > +
> > +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> > +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
> > +            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> > +            ovs_list_remove(&dmp->list_node);
> > +            rte_mempool_free(dmp->mp);
> > +            rte_free(dmp);
> > +        }
> >      }
> > +}
> > +
> > +static struct dpdk_mp *
> > +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) {
> > +    struct dpdk_mp *dmp;
> > +    bool reuse = false;
> >
> >      ovs_mutex_lock(&dpdk_mp_mutex);
> > -    if (dpdk_mp_full(mp)) {
> > -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> > -        rte_mempool_free(mp);
> > -    } else {
> > -        struct dpdk_mp *dmp;
> > +    /* Check if shared mempools are being used, if so check existing
> mempools
> > +     * to see if reuse is possible. */
> > +    if (!per_port_mp) {
> > +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> > +            if (dmp->socket_id == dev->requested_socket_id
> > +                && dmp->mtu == mtu) {
> > +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> > +                dmp->refcount++;
> > +                reuse = true;
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    /* Sweep mempools after reuse or before create. */
> > +    dpdk_mp_sweep();
> >
> > -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> > +    if (!reuse) {
> > +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
> >          if (dmp) {
> > -            dmp->mp = mp;
> > -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> > +            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> >          }
> >      }
> > +
> >      ovs_mutex_unlock(&dpdk_mp_mutex);
> > +
> > +    return dmp;
> >  }
> >
> > -/* Tries to allocate a new mempool - or re-use an existing one where
> > - * appropriate - on requested_socket_id with a size determined by
> > - * requested_mtu and requested Rx/Tx queues.
> > - * On success - or when re-using an existing mempool - the new
> > configuration
> > - * will be applied.
> > +/* Decrement reference to a mempool. */ static void
> > +dpdk_mp_put(struct dpdk_mp *dmp) {
> > +    if (!dmp) {
> > +        return;
> > +    }
> > +
> > +    ovs_mutex_lock(&dpdk_mp_mutex);
> > +    ovs_assert(dmp->refcount);
> > +    dmp->refcount--;
> > +    ovs_mutex_unlock(&dpdk_mp_mutex); }
> > +
> > +/* Depending on the memory model being used this function tries
> > + * identify and reuse an existing mempool or tries to allocate new
> > + * mempool on requested_socket_id with mbuf size corresponding to
> > + * requested_mtu. On success new configuration will be applied.
> >   * On error, device will be left unchanged. */  static int
> > netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
> >      OVS_REQUIRES(dev->mutex)
> >  {
> >      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> > -    struct rte_mempool *mp;
> > +    struct dpdk_mp *mp;
> >      int ret = 0;
> > +    bool per_port_mp = dpdk_per_port_mempool_enabled();
> >
> > -    dpdk_mp_sweep();
> > +    /* With shared mempools we do not need to configure a mempool if
> the MTU
> > +     * and socket ID have not changed, the previous configuration is
> still
> > +     * valid so return 0 */
> > +    if (dev->mtu == dev->requested_mtu
> > +        && dev->socket_id == dev->requested_socket_id
> > +        && (!per_port_mp)) {
> 
> can remove brackets and probably more intuitive to check the memory model
> first

Yes, with your comment later as regards the refactor based in memory model I 
can address this.

> 
> > +        return ret;
> > +    }
> >
> > -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> > +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
> >      if (!mp) {
> >          VLOG_ERR("Failed to create memory pool for netdev "
> >                   "%s, with MTU %d on socket %d: %s\n",
> >                   dev->up.name, dev->requested_mtu, dev-
> >requested_socket_id,
> > -                 rte_strerror(rte_errno));
> > -        ret = rte_errno;
> > +        rte_strerror(rte_errno));
> > +        return rte_errno;
> >      } else {
> > -        /* If a new MTU was requested and its rounded value equals the
> one
> > -         * that is currently used, then the existing mempool is
> returned. */
> > -        if (dev->mp != mp) {
> > -            /* A new mempool was created, release the previous one. */
> > -            dpdk_mp_release(dev->mp);
> > -        } else {
> > -            ret = EEXIST;
> > +        /* Check for any pre-existing dpdk_mp for the device */
> > +        if (dev->dpdk_mp != NULL) {
> 
> Not sure you need this check, dpdk_mp_put will handle NULL
> 

Yes dpdk_mp_put will handle NULL, but if statement below for comparison of 
mempools attempts to access the mempools which are part of the dpdk_mp struct.
In the case that a device is newly created there is no previous dpdk_mp and 
accessing dpdk_mp->mp will cause a segfault.

> > +            /* If a new MTU was requested and its rounded value equals
> the
> > +             * one that is currently used, then the existing mempool is
> > +             * returned. */
> > +            if (dev->dpdk_mp->mp != mp->mp) {
> > +                /* A new mempool was created, release the previous one.
> */
> > +                dpdk_mp_put(dev->dpdk_mp);
> > +            } else {
> > +                /* If the mempool already exists in the current dpdk_mp
> then
> > +                 * we need to ensure dpdk_mp that was just created is
> freed in
> > +                 * the next sweep as it will not be used. *> +
> dpdk_mp_put(mp);
> 
> This doesn't look right - it will likely lead to the mempool you are using
> being freed also. I think you just want to free the dpdk_mp struct, but
> not the actual mempool.

Oops, good catch, your correct, we only need to free the dpdk_mp in this case 
but not the mempool itself.

At this point the dpdk_mp has already been added to the list, it could make 
more sense to do this work at the dpdk_mp_create() function instead to avoid 
messing with the list altogether for the EEXIST case. I'll look into this for 
the v2.
> 
> > +                ret = EEXIST;
> 
> Setting ret, but returning 0 below
> 
> > +            }
> >          }
> > -        dev->mp = mp;
> > +        dev->dpdk_mp = mp;
> >          dev->mtu = dev->requested_mtu;
> >          dev->socket_id = dev->requested_socket_id;
> >          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >      }
> >
> > -    return ret;
> > +    return 0;
> >  }
> >
> 
> I know there was lots of paths already, but adding two different schemes
> is making the code here and in create even more confusing. I think there's
> a few subtle bugs above, and it's probably down to the complexity. I
> wonder could you use the per_port_mp a bit more so at least the paths for
> each scheme are a more segregated? When reviewing, it's hard to understand
> if a code branch would happen with one scheme or the other, or both.

+1, I'm looking at refactoring based on this per_port_mp the v2. The original 
intention was to keep the code footprint as small as possible but for 
readability and clarity on code paths I'd agree.

Thanks
Ian
> 
> Kevin.
> 
> >  static void
> > @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
> > int n_rxq, int n_txq)
> >
> >          for (i = 0; i < n_rxq; i++) {
> >              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev-
> >rxq_size,
> > -                                          dev->socket_id, NULL, dev-
> >mp);
> > +                                          dev->socket_id, NULL,
> > +                                          dev->dpdk_mp->mp);
> >              if (diag) {
> >                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
> >                            dev->up.name, i, rte_strerror(-diag)); @@
> > -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
> >      rte_eth_link_get_nowait(dev->port_id, &dev->link);
> >
> > -    mbp_priv = rte_mempool_get_priv(dev->mp);
> > +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
> >      dev->buf_size = mbp_priv->mbuf_data_room_size -
> > RTE_PKTMBUF_HEADROOM;
> >
> >      /* Get the Flow control configuration for DPDK-ETH */ @@ -1176,7
> > +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
> >      OVS_EXCLUDED(dev->mutex)
> >  {
> >      rte_free(dev->tx_q);
> > -    dpdk_mp_release(dev->mp);
> > +    dpdk_mp_put(dev->dpdk_mp);
> >
> >      ovs_list_remove(&dev->list_node);
> >      free(ovsrcu_get_protected(struct ingress_policer *, @@ -1928,7
> > +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> >          return EAGAIN;
> >      }
> >
> > -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> > +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
> >                                      (struct rte_mbuf **) batch-
> >packets,
> >                                      NETDEV_MAX_BURST);
> >      if (!nb_rx) {
> > @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet_batch *batch)
> >              continue;
> >          }
> >
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> > +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> >          if (OVS_UNLIKELY(!pkts[txcnt])) {
> >              dropped += cnt - i;
> >              break;
> > @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn
> *conn,
> >          ovs_mutex_lock(&dev->mutex);
> >          ovs_mutex_lock(&dpdk_mp_mutex);
> >
> > -        rte_mempool_dump(stream, dev->mp);
> > +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
> >
> >          ovs_mutex_unlock(&dpdk_mp_mutex);
> >          ovs_mutex_unlock(&dev->mutex); @@ -3742,7 +3814,7 @@
> > dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
> >
> >      err = netdev_dpdk_mempool_configure(dev);
> >      if (!err) {
> > -        /* A new mempool was created. */
> > +        /* A new mempool was created or re-used. */
> >          netdev_change_seq_changed(&dev->up);
> >      } else if (err != EEXIST){
> >          return err;
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > 7ab90d5..95520af 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -359,6 +359,22 @@
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="per-port-mp-enabled"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          By default OVS DPDK uses a shared mempool memory model
> wherein
> > +          devices that have the same MTU and socket values can share
> the same
> > +          mempool. Setting this value to <code>true</code> changes this
> > +          behaviour. Per port mempools allow DPDK devices to use a
> private
> > +          mempool per device. This can provide greater transapraency as
> > +          regards memory usage but potentially at the cost of greater
> memory
> > +          requirements.
> > +        </p>
> > +        <p>
> > +          Changing this value requires restarting the daemon.
> > +        </p>
> > +      </column>
> > +
> >        <column name="other_config" key="tx-flush-interval"
> >                type='{"type": "integer",
> >                       "minInteger": 0, "maxInteger": 1000000}'>
> >

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC PATCH v1] dpdk: Support both shared and per port mempools.

Reply via email to