Hi Ian, Thanks for bringing this forward.
I've tested this on current master by re-configuring MTUs of existing ports, adding new ports, deleting existing ones, etc. It seems to be behaving as expected: - On the shared model, it allocates a new mempool only if the MTU changes, otherwise reuses the already existing one. And it frees a mempool only if it isn't being used by any port; - On the per-port model, it allocates a new mempool for every port. And it frees the mempools when the ports are destroyed / MTU is changed. There's a catch on how mempools are freed, though. I have a comment on that in-line below*. (I see Kevin has sent a review as well, so I'll refrain to touch on the same points, unless to re-iterate.) On 18/05/2018 15:30, Ian Stokes wrote: > This commit re-introduces the concept of shared mempools as the default > memory model for DPDK devices. Per port mempools are still available but > must be enabled explicitly by a user. > > OVS previously used a shared mempool model for ports with the same MTU > and socket configuration. This was replaced by a per port mempool model > to address issues flagged by users such as: > > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html > > However the per port model potentially requires an increase in memory > resource requirements to support the same number of ports and configuration > as the shared port model. > > This is considered a blocking factor for current deployments of OVS > when upgrading to future OVS releases as a user may have to redimension > memory for the same deployment configuration. This may not be possible for > users. > > This commit resolves the issue by re-introducing shared mempools as > the default memory behaviour in OVS DPDK but also refactors the memory > configuration code to allow for per port mempools. > > This patch adds a new global config option, per-port-mp-enabled, that > controls the enablement of per port mempools for DPDK devices. > > ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true > > This value defaults to false; to enable per port mempool support, > this field should be set to true when setting other global parameters > on init (such as "dpdk-socket-mem", for example). Changing the value at > runtime is not supported, and requires restarting the vswitch > daemon. > > The mempool sweep functionality is also replaced with the > sweep functionality from OVS 2.9 found in commits > > c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.) > a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.) > > As this patch is RFC there are a number of TO-DOs including adding a > memory calculation section to the documentation for both models. This is > expected to be completed in the v1 after RFC. > > Signed-off-by: Ian Stokes <ian.sto...@intel.com> > --- > Documentation/automake.mk | 1 + > Documentation/topics/dpdk/index.rst | 1 + > Documentation/topics/dpdk/memory.rst | 67 +++++++ > NEWS | 1 + > lib/dpdk-stub.c | 6 + > lib/dpdk.c | 13 ++ > lib/dpdk.h | 1 + > lib/netdev-dpdk.c | 326 > +++++++++++++++++++++-------------- > vswitchd/vswitch.xml | 16 ++ > 9 files changed, 305 insertions(+), 127 deletions(-) > create mode 100644 Documentation/topics/dpdk/memory.rst > > diff --git a/Documentation/automake.mk b/Documentation/automake.mk > index 683ca14..14c2189 100644 > --- a/Documentation/automake.mk > +++ b/Documentation/automake.mk > @@ -36,6 +36,7 @@ DOC_SOURCE = \ > Documentation/topics/dpdk/index.rst \ > Documentation/topics/dpdk/bridge.rst \ > Documentation/topics/dpdk/jumbo-frames.rst \ > + Documentation/topics/dpdk/memory.rst \ > Documentation/topics/dpdk/pdump.rst \ > Documentation/topics/dpdk/phy.rst \ > Documentation/topics/dpdk/pmd.rst \ > diff --git a/Documentation/topics/dpdk/index.rst > b/Documentation/topics/dpdk/index.rst > index 181f61a..cf24a7b 100644 > --- a/Documentation/topics/dpdk/index.rst > +++ b/Documentation/topics/dpdk/index.rst > @@ -40,3 +40,4 @@ The DPDK Datapath > /topics/dpdk/qos > /topics/dpdk/pdump > /topics/dpdk/jumbo-frames > + /topics/dpdk/memory > diff --git a/Documentation/topics/dpdk/memory.rst > b/Documentation/topics/dpdk/memory.rst > new file mode 100644 > index 0000000..1198067 > --- /dev/null > +++ b/Documentation/topics/dpdk/memory.rst > @@ -0,0 +1,67 @@ > +.. > + Copyright 2018, Intel, Inc. > + > + Licensed under the Apache License, Version 2.0 (the "License"); you may > + not use this file except in compliance with the License. You may obtain > + a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, software > + distributed under the License is distributed on an "AS IS" BASIS, > WITHOUT > + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See > the > + License for the specific language governing permissions and limitations > + under the License. > + > + Convention for heading levels in Open vSwitch documentation: > + > + ======= Heading 0 (reserved for the title in a document) > + ------- Heading 1 > + ~~~~~~~ Heading 2 > + +++++++ Heading 3 > + ''''''' Heading 4 > + > + Avoid deeper levels because they do not render well. > + > +========================= > +DPDK Device Memory Models > +========================= > + > +DPDK device memory can be allocated in one of two ways in OVS DPDK, > +shared mempools or per port mempools. The specifics of both are detailed > +below. > + > +Shared Mempools > +--------------- > + > +By Default OVS DPDK uses a shared mempool model. This means that multiple > +ports can share the same mempool. For example when a port is added it will > +have a given MTU and socket ID associated with it. If a mempool has been > +created previously for an existing port that has the same MTU and socket ID, > +that mempool is used for both ports. If there is no existing mempool > +supporting these parameters then a new mempool is created. > + > + > +Per Port Mempools > +----------------- > + > +In the per port mempool model, mempools are created per device and are not > +shared. The benefit of this is a more transparent memory model where mempools > +will not be exhausted by other DPDK devices. However this comes at a > potential > +increase in cost for memory dimensioning for a given deployment. Users should > +be aware of the memory requirements for their deployment before using this > +model and allocate the required hugepage memory. > + > +Per port mempool support may be enabled via a global config value, > +```per-port-mp-enabled```. Setting this to true enables the per port mempool > +model for all DPDK devices in OVS:: > + > + $ ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true > + > +.. important:: > + > + Changing this value requires restarting the daemon. > + > +.. todo:: > + > + Add memory calculation section for both mempool models. > diff --git a/NEWS b/NEWS > index ec548b0..c9991cf 100644 > --- a/NEWS > +++ b/NEWS > @@ -30,6 +30,7 @@ Post-v2.9.0 > * New 'check-dpdk' Makefile target to run a new system testsuite. > See Testing topic for the details. > * Add LSC interrupt support for DPDK physical devices. > + * Support both shared and per port mempools for DPDK devices. > - Userspace datapath: > * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single > PMD > * Detailed PMD performance metrics available with new command > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c > index 041cd0c..d2a9718 100644 > --- a/lib/dpdk-stub.c > +++ b/lib/dpdk-stub.c > @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void) > return false; > } > > +bool > +dpdk_per_port_mempool_enabled(void) > +{ > + return false; > +} > + > void > print_dpdk_version(void) > { > diff --git a/lib/dpdk.c b/lib/dpdk.c > index 00dd974..e251970 100644 > --- a/lib/dpdk.c > +++ b/lib/dpdk.c > @@ -43,6 +43,8 @@ static FILE *log_stream = NULL; /* Stream for DPDK > log redirection */ > > static char *vhost_sock_dir = NULL; /* Location of vhost-user sockets */ > static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */ > +static bool per_port_mp_enabled = false; /* Status of per port mempool > + * support */ > > static int > process_vhost_flags(char *flag, const char *default_val, int size, > @@ -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config) > VLOG_INFO("IOMMU support for vhost-user-client %s.", > vhost_iommu_enabled ? "enabled" : "disabled"); > > + per_port_mp_enabled = smap_get_bool(ovs_other_config, > + "per-port-mp-enabled", false); > + VLOG_INFO("Per port mempool for DPDK devices %s.", > + per_port_mp_enabled ? "enabled" : "disabled"); > + > argv = grow_argv(&argv, 0, 1); > argc = 1; > argv[0] = xstrdup(ovs_get_program_name()); > @@ -498,6 +505,12 @@ dpdk_vhost_iommu_enabled(void) > return vhost_iommu_enabled; > } > > +bool > +dpdk_per_port_mempool_enabled(void) > +{ > + return per_port_mp_enabled; > +} > + > void > dpdk_set_lcore_id(unsigned cpu) > { > diff --git a/lib/dpdk.h b/lib/dpdk.h > index b041535..243385c 100644 > --- a/lib/dpdk.h > +++ b/lib/dpdk.h > @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config); > void dpdk_set_lcore_id(unsigned cpu); > const char *dpdk_get_vhost_sock_dir(void); > bool dpdk_vhost_iommu_enabled(void); > +bool dpdk_per_port_mempool_enabled(void); > void print_dpdk_version(void); > > #endif /* dpdk.h */ > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index 87152a7..cda2ace 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = > VLOG_RATE_LIMIT_INIT(5, 20); > #define NETDEV_DPDK_MBUF_ALIGN 1024 > #define NETDEV_DPDK_MAX_PKT_LEN 9728 > > -/* Min number of packets in the mempool. OVS tries to allocate a mempool > with > - * roughly estimated number of mbufs: if this fails (because the system > doesn't > - * have enough hugepages) we keep halving the number until the allocation > - * succeeds or we reach MIN_NB_MBUF */ > +/* Max and min number of packets in the mempool. OVS tries to allocate a > + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have > + * enough hugepages) we keep halving the number until the allocation succeeds > + * or we reach MIN_NB_MBUF */ > + > +#define MAX_NB_MBUF (4096 * 64) > #define MIN_NB_MBUF (4096 * 4) > #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE > > +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ > +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF) > + == 0); > + > +/* The smallest possible NB_MBUF that we're going to try should be a multiple > + * of MP_CACHE_SZ. This is advised by DPDK documentation. */ > +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)) > + % MP_CACHE_SZ == 0); > + > /* > * DPDK XSTATS Counter names definition > */ > @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list > OVS_GUARDED_BY(dpdk_mutex) > static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex) > = OVS_MUTEX_INITIALIZER; > > -/* Contains all 'struct dpdk_mp's. */ > -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex) > - = OVS_LIST_INITIALIZER(&dpdk_mp_free_list); > +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex) > + = OVS_LIST_INITIALIZER(&dpdk_mp_list); > > -/* Wrapper for a mempool released but not yet freed. */ > struct dpdk_mp { > struct rte_mempool *mp; > + int mtu; > + int socket_id; > + int refcount; > struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex); > }; > > @@ -381,7 +393,7 @@ struct netdev_dpdk { > > PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1, > struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex); > - struct rte_mempool *mp; > + struct dpdk_mp *dpdk_mp; > > /* virtio identifier for vhost devices */ > ovsrcu_index vid; > @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED, > dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len); > } > > -static int > -dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex) > -{ > - unsigned ring_count; > - /* This logic is needed because rte_mempool_full() is not guaranteed to > - * be atomic and mbufs could be moved from mempool cache --> mempool ring > - * during the call. However, as no mbufs will be taken from the mempool > - * at this time, we can work around it by also checking the ring entries > - * separately and ensuring that they have not changed. > - */ > - ring_count = rte_mempool_ops_get_count(mp); > - if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) > { > - return 1; > - } > - > - return 0; > -} > - > -/* Free unused mempools. */ > -static void > -dpdk_mp_sweep(void) > +/* Calculating the required number of mbufs differs depending on the > + * mempool model being used. Check if per port mempools are in use before > + * calculating. > + */ > +static uint32_t > +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp) > { > - struct dpdk_mp *dmp, *next; > + uint32_t n_mbufs; > > - ovs_mutex_lock(&dpdk_mp_mutex); > - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) { > - if (dpdk_mp_full(dmp->mp)) { > - VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name); > - ovs_list_remove(&dmp->list_node); > - rte_mempool_free(dmp->mp); > - rte_free(dmp); > + if (!per_port_mp) { > + /* Shared mempools are being used. > + * XXX: this is a really rough method of provisioning memory. > + * It's impossible to determine what the exact memory requirements > are > + * when the number of ports and rxqs that utilize a particular > mempool > + * can change dynamically at runtime. For now, use this rough > + * heurisitic. > + */ > + if (mtu >= ETHER_MTU) { > + n_mbufs = MAX_NB_MBUF; Is this going to be configurable in the future? > + } else { > + n_mbufs = MIN_NB_MBUF; > } > + } else { > + /* Per port mempools are being used. > + * XXX: rough estimation of number of mbufs required for this port: > + * <packets required to fill the device rxqs> > + * + <packets that could be stuck on other ports txqs> > + * + <packets in the pmd threads> > + * + <additional memory for corner cases> > + */ > + n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size > + + dev->requested_n_txq * dev->requested_txq_size > + + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * > NETDEV_MAX_BURST > + + MIN_NB_MBUF; > } > - ovs_mutex_unlock(&dpdk_mp_mutex); > -} > - > -/* Ensure a mempool will not be freed. */ > -static void > -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex) > -{ > - struct dpdk_mp *dmp, *next; > > - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) { > - if (dmp->mp == mp) { > - VLOG_DBG("Removing mempool \"%s\" from free list", > dmp->mp->name); > - ovs_list_remove(&dmp->list_node); > - rte_free(dmp); > - break; > - } > - } > + return n_mbufs; > } > > -/* Returns a valid pointer when either of the following is true: > - * - a new mempool was just created; > - * - a matching mempool already exists. */ > -static struct rte_mempool * > -dpdk_mp_create(struct netdev_dpdk *dev, int mtu) > +static struct dpdk_mp * > +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp) > { > char mp_name[RTE_MEMPOOL_NAMESIZE]; > const char *netdev_name = netdev_get_name(&dev->up); > int socket_id = dev->requested_socket_id; > uint32_t n_mbufs; > uint32_t hash = hash_string(netdev_name, 0); > - struct rte_mempool *mp = NULL; > - > - /* > - * XXX: rough estimation of number of mbufs required for this port: > - * <packets required to fill the device rxqs> > - * + <packets that could be stuck on other ports txqs> > - * + <packets in the pmd threads> > - * + <additional memory for corner cases> > - */ > - n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size > - + dev->requested_n_txq * dev->requested_txq_size > - + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST > - + MIN_NB_MBUF; > + struct dpdk_mp *dmp = NULL; > + int ret; > + > + dmp = dpdk_rte_mzalloc(sizeof *dmp); > + if (!dmp) { > + return NULL; > + } > + dmp->socket_id = socket_id; > + dmp->mtu = mtu; > + dmp->refcount = 1; > + > + n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp); > > - ovs_mutex_lock(&dpdk_mp_mutex); > do { > /* Full DPDK memory pool name must be unique and cannot be > * longer than RTE_MEMPOOL_NAMESIZE. */ > - int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, > + ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, > "ovs%08x%02d%05d%07u", > hash, socket_id, mtu, n_mbufs); > if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) { > @@ -623,102 +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu) > netdev_name, n_mbufs, socket_id, > dev->requested_n_rxq, dev->requested_n_txq); > > - mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ, > - sizeof (struct dp_packet) - sizeof (struct rte_mbuf), > - MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id); > + dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, > + MP_CACHE_SZ, > + sizeof (struct dp_packet) > + - sizeof (struct rte_mbuf), > + MBUF_SIZE(mtu) > + - sizeof(struct dp_packet), > + socket_id); > > - if (mp) { > + if (dmp->mp) { > VLOG_DBG("Allocated \"%s\" mempool with %u mbufs", > mp_name, n_mbufs); > /* rte_pktmbuf_pool_create has done some initialization of the > - * rte_mbuf part of each dp_packet. Some OvS specific fields > - * of the packet still need to be initialized by > - * ovs_rte_pktmbuf_init. */ > - rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL); > + * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init > + * initializes some OVS specific fields of dp_packet. > + */ > + rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL); > + return dmp; > } else if (rte_errno == EEXIST) { > /* A mempool with the same name already exists. We just > * retrieve its pointer to be returned to the caller. */ > - mp = rte_mempool_lookup(mp_name); > + dmp->mp = rte_mempool_lookup(mp_name); > /* As the mempool create returned EEXIST we can expect the > * lookup has returned a valid pointer. If for some reason > * that's not the case we keep track of it. */ > VLOG_DBG("A mempool with name \"%s\" already exists at %p.", > - mp_name, mp); > - /* Ensure this reused mempool will not be freed. */ > - dpdk_mp_do_not_free(mp); > + mp_name, dmp->mp); > } else { > VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs", > mp_name, n_mbufs); > } Just a thought, but now with shared memory where n_mbufs are initially set to 4096*64, one can see this error log printed a few times before the memory gets allocated. We could potentially demote this to a WARN and write a more friendly message and only print the error below, before returning to the caller (at that point we surely couldn't allocate the mempool). > - } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF); > + } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= > MIN_NB_MBUF); > > - ovs_mutex_unlock(&dpdk_mp_mutex); > - return mp; > + rte_free(dmp); > + return NULL; > } > > -/* Release an existing mempool. */ > +static int > +dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex) > +{ > + unsigned ring_count; > + /* This logic is needed because rte_mempool_full() is not guaranteed to > + * be atomic and mbufs could be moved from mempool cache --> mempool ring > + * during the call. However, as no mbufs will be taken from the mempool > + * at this time, we can work around it by also checking the ring entries > + * separately and ensuring that they have not changed. > + */ > + ring_count = rte_mempool_ops_get_count(mp); > + if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) > { > + return 1; > + } > + > + return 0; > +} > + > +/* Free unused mempools. */ > static void > -dpdk_mp_release(struct rte_mempool *mp) > +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) > { > - if (!mp) { > - return; > + struct dpdk_mp *dmp, *next; > + > + LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) { > + if (!dmp->refcount && dpdk_mp_full(dmp->mp)) { > + VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name); > + ovs_list_remove(&dmp->list_node); > + rte_mempool_free(dmp->mp); > + rte_free(dmp); > + } > } > +} > + > +static struct dpdk_mp * > +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) > +{ > + struct dpdk_mp *dmp; > + bool reuse = false; > > ovs_mutex_lock(&dpdk_mp_mutex); > - if (dpdk_mp_full(mp)) { > - VLOG_DBG("Freeing mempool \"%s\"", mp->name); > - rte_mempool_free(mp); > - } else { > - struct dpdk_mp *dmp; > + /* Check if shared mempools are being used, if so check existing mempools > + * to see if reuse is possible. */ > + if (!per_port_mp) { > + LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { > + if (dmp->socket_id == dev->requested_socket_id > + && dmp->mtu == mtu) { > + VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name); > + dmp->refcount++; > + reuse = true; > + break; > + } > + } > + } > + /* Sweep mempools after reuse or before create. */ > + dpdk_mp_sweep(); *Should dpdk_mp_sweep() be called when destroying an interface as well, in common_destruct()? While testing, I've noticed that if I add one port and delete the same port the mempool will still be allocated until you add another port, since dpdk_mp_sweep() will only be called on the next call to dpdp_mp_get(). This could potentially create problems in the per-port model where one deletes a certain number of ports and can't add any other ports because there's (hanging) mempools holding memory. > > - dmp = dpdk_rte_mzalloc(sizeof *dmp); > + if (!reuse) { > + dmp = dpdk_mp_create(dev, mtu, per_port_mp); > if (dmp) { > - dmp->mp = mp; > - ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node); > + ovs_list_push_back(&dpdk_mp_list, &dmp->list_node); > } > } > + > ovs_mutex_unlock(&dpdk_mp_mutex); > + > + return dmp; > } > > -/* Tries to allocate a new mempool - or re-use an existing one where > - * appropriate - on requested_socket_id with a size determined by > - * requested_mtu and requested Rx/Tx queues. > - * On success - or when re-using an existing mempool - the new configuration > - * will be applied. > +/* Decrement reference to a mempool. */ > +static void > +dpdk_mp_put(struct dpdk_mp *dmp) > +{ > + if (!dmp) { > + return; > + } > + > + ovs_mutex_lock(&dpdk_mp_mutex); > + ovs_assert(dmp->refcount); > + dmp->refcount--; > + ovs_mutex_unlock(&dpdk_mp_mutex); > +} > + > +/* Depending on the memory model being used this function tries > + * identify and reuse an existing mempool or tries to allocate new > + * mempool on requested_socket_id with mbuf size corresponding to > + * requested_mtu. On success new configuration will be applied. > * On error, device will be left unchanged. */ > static int > netdev_dpdk_mempool_configure(struct netdev_dpdk *dev) > OVS_REQUIRES(dev->mutex) > { > uint32_t buf_size = dpdk_buf_size(dev->requested_mtu); > - struct rte_mempool *mp; > + struct dpdk_mp *mp; > int ret = 0; > + bool per_port_mp = dpdk_per_port_mempool_enabled(); > > - dpdk_mp_sweep(); > + /* With shared mempools we do not need to configure a mempool if the MTU > + * and socket ID have not changed, the previous configuration is still > + * valid so return 0 */ > + if (dev->mtu == dev->requested_mtu > + && dev->socket_id == dev->requested_socket_id > + && (!per_port_mp)) { I also find that moving the `(!per_port_mp)` condition to the beginning improves readability here. It even goes in hand with your comment just above - "With shared mempools we do not ...". > + return ret; > + } > > - mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size)); > + mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp); > if (!mp) { > VLOG_ERR("Failed to create memory pool for netdev " > "%s, with MTU %d on socket %d: %s\n", > dev->up.name, dev->requested_mtu, dev->requested_socket_id, > - rte_strerror(rte_errno)); > - ret = rte_errno; > + rte_strerror(rte_errno)); Missing indentation here. > + return rte_errno; > } else { > - /* If a new MTU was requested and its rounded value equals the one > - * that is currently used, then the existing mempool is returned. */ > - if (dev->mp != mp) { > - /* A new mempool was created, release the previous one. */ > - dpdk_mp_release(dev->mp); > - } else { > - ret = EEXIST; > + /* Check for any pre-existing dpdk_mp for the device */ > + if (dev->dpdk_mp != NULL) { > + /* If a new MTU was requested and its rounded value equals the > + * one that is currently used, then the existing mempool is > + * returned. */ > + if (dev->dpdk_mp->mp != mp->mp) { > + /* A new mempool was created, release the previous one. */ > + dpdk_mp_put(dev->dpdk_mp); > + } else { > + /* If the mempool already exists in the current dpdk_mp then > + * we need to ensure dpdk_mp that was just created is freed > in > + * the next sweep as it will not be used. */ The code path around the `else` block here will only be called when `!per_port_mp`. This is because `dpdk_mp_get()` will only return an already existing mempool when using the shared model. Otherwise a new one is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)` will be true. Am I reading this right? If so then refactoring this a bit to differentiate on `per_port_mp` might help on readability - this goes in-line with Kevin's comment about making this piece a bit more readable. On the same note, the comment above mislead me to think that the allocated `mp` is being freed, which would result in error since the same `mp` is then assigned below. Instead, what it is doing is decrementing the refcount in struct dpdk_mp, which might end up being freed on the next dpdk_mp_sweep() if `refcount=0`. But that won't happen on the shared model unless no port is using the mempool. Thanks, Tiago. > + dpdk_mp_put(mp); > + ret = EEXIST; > + } > } > - dev->mp = mp; > + dev->dpdk_mp = mp; > dev->mtu = dev->requested_mtu; > dev->socket_id = dev->requested_socket_id; > dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); > } > > - return ret; > + return 0; > } > > static void > @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int > n_rxq, int n_txq) > > for (i = 0; i < n_rxq; i++) { > diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size, > - dev->socket_id, NULL, dev->mp); > + dev->socket_id, NULL, > + dev->dpdk_mp->mp); > if (diag) { > VLOG_INFO("Interface %s unable to setup rxq(%d): %s", > dev->up.name, i, rte_strerror(-diag)); > @@ -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN); > rte_eth_link_get_nowait(dev->port_id, &dev->link); > > - mbp_priv = rte_mempool_get_priv(dev->mp); > + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp); > dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM; > > /* Get the Flow control configuration for DPDK-ETH */ > @@ -1176,7 +1248,7 @@ common_destruct(struct netdev_dpdk *dev) > OVS_EXCLUDED(dev->mutex) > { > rte_free(dev->tx_q); > - dpdk_mp_release(dev->mp); > + dpdk_mp_put(dev->dpdk_mp); > > ovs_list_remove(&dev->list_node); > free(ovsrcu_get_protected(struct ingress_policer *, > @@ -1928,7 +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, > return EAGAIN; > } > > - nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp, > + nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp, > (struct rte_mbuf **) batch->packets, > NETDEV_MAX_BURST); > if (!nb_rx) { > @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct > dp_packet_batch *batch) > continue; > } > > - pkts[txcnt] = rte_pktmbuf_alloc(dev->mp); > + pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp); > if (OVS_UNLIKELY(!pkts[txcnt])) { > dropped += cnt - i; > break; > @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn, > ovs_mutex_lock(&dev->mutex); > ovs_mutex_lock(&dpdk_mp_mutex); > > - rte_mempool_dump(stream, dev->mp); > + rte_mempool_dump(stream, dev->dpdk_mp->mp); > > ovs_mutex_unlock(&dpdk_mp_mutex); > ovs_mutex_unlock(&dev->mutex); > @@ -3742,7 +3814,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev) > > err = netdev_dpdk_mempool_configure(dev); > if (!err) { > - /* A new mempool was created. */ > + /* A new mempool was created or re-used. */ > netdev_change_seq_changed(&dev->up); > } else if (err != EEXIST){ > return err; > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index 7ab90d5..95520af 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -359,6 +359,22 @@ > </p> > </column> > > + <column name="other_config" key="per-port-mp-enabled" > + type='{"type": "boolean"}'> > + <p> > + By default OVS DPDK uses a shared mempool memory model wherein > + devices that have the same MTU and socket values can share the same > + mempool. Setting this value to <code>true</code> changes this > + behaviour. Per port mempools allow DPDK devices to use a private > + mempool per device. This can provide greater transapraency as > + regards memory usage but potentially at the cost of greater memory > + requirements. > + </p> > + <p> > + Changing this value requires restarting the daemon. > + </p> > + </column> > + > <column name="other_config" key="tx-flush-interval" > type='{"type": "integer", > "minInteger": 0, "maxInteger": 1000000}'> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev