> On 05/18/2018 03:30 PM, Ian Stokes wrote: > > This commit re-introduces the concept of shared mempools as the > > default memory model for DPDK devices. Per port mempools are still > > available but must be enabled explicitly by a user. > > > > OVS previously used a shared mempool model for ports with the same MTU > > and socket configuration. This was replaced by a per port mempool > > model to address issues flagged by users such as: > > > > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/0425 > > 60.html > > > > However the per port model potentially requires an increase in memory > > resource requirements to support the same number of ports and > > configuration as the shared port model. > > > > This is considered a blocking factor for current deployments of OVS > > when upgrading to future OVS releases as a user may have to > > redimension memory for the same deployment configuration. This may not > > be possible for users. > > > > Hi Ian, thanks for addressing this. Few comments below... > > > This commit resolves the issue by re-introducing shared mempools as > > the default memory behaviour in OVS DPDK but also refactors the memory > > configuration code to allow for per port mempools. > > > This patch adds a new global config option, per-port-mp-enabled, that > > controls the enablement of per port mempools for DPDK devices. > > > > ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true > > > > It doesn't need enabled in the name. If say 'per-port-mempool=true/false', > then it's already obvious it's enabled. > Actually, I wonder if 'per-port-memory' would be better as it doesn't > require a user to know what a DPDK mempool is. >
Agree, will change in the v2. > > This value defaults to false; to enable per port mempool support, this > > field should be set to true when setting other global parameters on > > init (such as "dpdk-socket-mem", for example). Changing the value at > > runtime is not supported, and requires restarting the vswitch daemon. > > > > The mempool sweep functionality is also replaced with the sweep > > functionality from OVS 2.9 found in commits > > > > c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.) > > a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.) > > > > As this patch is RFC there are a number of TO-DOs including adding a > > memory calculation section to the documentation for both models. This > > is expected to be completed in the v1 after RFC. > > > > Signed-off-by: Ian Stokes <ian.sto...@intel.com> > > --- > > Documentation/automake.mk | 1 + > > Documentation/topics/dpdk/index.rst | 1 + > > Documentation/topics/dpdk/memory.rst | 67 +++++++ > > NEWS | 1 + > > lib/dpdk-stub.c | 6 + > > lib/dpdk.c | 13 ++ > > lib/dpdk.h | 1 + > > lib/netdev-dpdk.c | 326 +++++++++++++++++++++------- > ------- > > vswitchd/vswitch.xml | 16 ++ > > 9 files changed, 305 insertions(+), 127 deletions(-) create mode > > 100644 Documentation/topics/dpdk/memory.rst > > > > diff --git a/Documentation/automake.mk b/Documentation/automake.mk > > index 683ca14..14c2189 100644 > > --- a/Documentation/automake.mk > > +++ b/Documentation/automake.mk > > @@ -36,6 +36,7 @@ DOC_SOURCE = \ > > Documentation/topics/dpdk/index.rst \ > > Documentation/topics/dpdk/bridge.rst \ > > Documentation/topics/dpdk/jumbo-frames.rst \ > > + Documentation/topics/dpdk/memory.rst \ > > Documentation/topics/dpdk/pdump.rst \ > > Documentation/topics/dpdk/phy.rst \ > > Documentation/topics/dpdk/pmd.rst \ > > diff --git a/Documentation/topics/dpdk/index.rst > > b/Documentation/topics/dpdk/index.rst > > index 181f61a..cf24a7b 100644 > > --- a/Documentation/topics/dpdk/index.rst > > +++ b/Documentation/topics/dpdk/index.rst > > @@ -40,3 +40,4 @@ The DPDK Datapath > > /topics/dpdk/qos > > /topics/dpdk/pdump > > /topics/dpdk/jumbo-frames > > + /topics/dpdk/memory > > diff --git a/Documentation/topics/dpdk/memory.rst > > b/Documentation/topics/dpdk/memory.rst > > new file mode 100644 > > index 0000000..1198067 > > --- /dev/null > > +++ b/Documentation/topics/dpdk/memory.rst > > @@ -0,0 +1,67 @@ > > +.. > > + Copyright 2018, Intel, Inc. > > + > > + Licensed under the Apache License, Version 2.0 (the "License"); > you may > > + not use this file except in compliance with the License. You may > obtain > > + a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, > software > > + distributed under the License is distributed on an "AS IS" BASIS, > WITHOUT > > + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > See the > > + License for the specific language governing permissions and > limitations > > + under the License. > > + > > + Convention for heading levels in Open vSwitch documentation: > > + > > + ======= Heading 0 (reserved for the title in a document) > > + ------- Heading 1 > > + ~~~~~~~ Heading 2 > > + +++++++ Heading 3 > > + ''''''' Heading 4 > > + > > + Avoid deeper levels because they do not render well. > > + > > +========================= > > +DPDK Device Memory Models > > +========================= > > + > > +DPDK device memory can be allocated in one of two ways in OVS DPDK, > > +shared mempools or per port mempools. The specifics of both are > > +detailed below. > > + > > +Shared Mempools > > +--------------- > > + > > +By Default OVS DPDK uses a shared mempool model. This means that > > +multiple ports can share the same mempool. For example when a port is > > +added it will have a given MTU and socket ID associated with it. If a > > +mempool has been created previously for an existing port that has the > > +same MTU and socket ID, that mempool is used for both ports. If there > > +is no existing mempool supporting these parameters then a new mempool > is created. > > + > > + > > +Per Port Mempools > > +----------------- > > + > > +In the per port mempool model, mempools are created per device and > > +are not shared. The benefit of this is a more transparent memory > > +model where mempools will not be exhausted by other DPDK devices. > > +However this comes at a potential increase in cost for memory > > +dimensioning for a given deployment. Users should be aware of the > > +memory requirements for their deployment before using this model and > allocate the required hugepage memory. > > + > > +Per port mempool support may be enabled via a global config value, > > +```per-port-mp-enabled```. Setting this to true enables the per port > > +mempool model for all DPDK devices in OVS:: > > + > > + $ ovs-vsctl set Open_vSwitch . > > + other_config:per-port-mp-enabled=true > > + > > +.. important:: > > + > > + Changing this value requires restarting the daemon. > > + > > It's a good general guide, but the daemon does not need to be restarted > once it's set before dpdk-init=true. Good point, on a separate note I think this would also apply to the vhost iommu option also? (It also requests daemon restart in documentation/commit messages). Will look into it. > > > +.. todo:: > > + > > + Add memory calculation section for both mempool models. > > diff --git a/NEWS b/NEWS > > index ec548b0..c9991cf 100644 > > --- a/NEWS > > +++ b/NEWS > > @@ -30,6 +30,7 @@ Post-v2.9.0 > > * New 'check-dpdk' Makefile target to run a new system testsuite. > > See Testing topic for the details. > > * Add LSC interrupt support for DPDK physical devices. > > + * Support both shared and per port mempools for DPDK devices. > > - Userspace datapath: > > * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a > single PMD > > * Detailed PMD performance metrics available with new command > > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 041cd0c..d2a9718 > > 100644 > > --- a/lib/dpdk-stub.c > > +++ b/lib/dpdk-stub.c > > @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void) > > return false; > > } > > > > +bool > > +dpdk_per_port_mempool_enabled(void) > > +{ > > + return false; > > +} > > + > > void > > print_dpdk_version(void) > > { > > diff --git a/lib/dpdk.c b/lib/dpdk.c > > index 00dd974..e251970 100644 > > --- a/lib/dpdk.c > > +++ b/lib/dpdk.c > > @@ -43,6 +43,8 @@ static FILE *log_stream = NULL; /* Stream for > DPDK log redirection */ > > > > static char *vhost_sock_dir = NULL; /* Location of vhost-user sockets > */ > > static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU > > support */ > > +static bool per_port_mp_enabled = false; /* Status of per port mempool > > + * support */ > > > > static int > > process_vhost_flags(char *flag, const char *default_val, int size, @@ > > -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config) > > VLOG_INFO("IOMMU support for vhost-user-client %s.", > > vhost_iommu_enabled ? "enabled" : "disabled"); > > > > + per_port_mp_enabled = smap_get_bool(ovs_other_config, > > + "per-port-mp-enabled", false); > > + VLOG_INFO("Per port mempool for DPDK devices %s.", > > + per_port_mp_enabled ? "enabled" : "disabled"); > > + > > argv = grow_argv(&argv, 0, 1); > > argc = 1; > > argv[0] = xstrdup(ovs_get_program_name()); @@ -498,6 +505,12 @@ > > dpdk_vhost_iommu_enabled(void) > > return vhost_iommu_enabled; > > } > > > > +bool > > +dpdk_per_port_mempool_enabled(void) > > +{ > > + return per_port_mp_enabled; > > +} > > + > > void > > dpdk_set_lcore_id(unsigned cpu) > > { > > diff --git a/lib/dpdk.h b/lib/dpdk.h > > index b041535..243385c 100644 > > --- a/lib/dpdk.h > > +++ b/lib/dpdk.h > > @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config); > > void dpdk_set_lcore_id(unsigned cpu); const char > > *dpdk_get_vhost_sock_dir(void); bool dpdk_vhost_iommu_enabled(void); > > +bool dpdk_per_port_mempool_enabled(void); > > void print_dpdk_version(void); > > > > #endif /* dpdk.h */ > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > > 87152a7..cda2ace 100644 > > --- a/lib/netdev-dpdk.c > > +++ b/lib/netdev-dpdk.c > > @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = > VLOG_RATE_LIMIT_INIT(5, 20); > > #define NETDEV_DPDK_MBUF_ALIGN 1024 > > #define NETDEV_DPDK_MAX_PKT_LEN 9728 > > > > -/* Min number of packets in the mempool. OVS tries to allocate a > > mempool with > > - * roughly estimated number of mbufs: if this fails (because the > > system doesn't > > - * have enough hugepages) we keep halving the number until the > > allocation > > - * succeeds or we reach MIN_NB_MBUF */ > > +/* Max and min number of packets in the mempool. OVS tries to > > +allocate a > > + * mempool with MAX_NB_MBUF: if this fails (because the system > > +doesn't have > > + * enough hugepages) we keep halving the number until the allocation > > +succeeds > > + * or we reach MIN_NB_MBUF */ > > + > > +#define MAX_NB_MBUF (4096 * 64) > > #define MIN_NB_MBUF (4096 * 4) > > #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE > > > > +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ > > +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / > MIN_NB_MBUF) > > + == 0); > > + > > +/* The smallest possible NB_MBUF that we're going to try should be a > > +multiple > > + * of MP_CACHE_SZ. This is advised by DPDK documentation. */ > > +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / > MIN_NB_MBUF)) > > + % MP_CACHE_SZ == 0); > > + > > /* > > * DPDK XSTATS Counter names definition > > */ > > @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list > > OVS_GUARDED_BY(dpdk_mutex) static struct ovs_mutex dpdk_mp_mutex > OVS_ACQ_AFTER(dpdk_mutex) > > = OVS_MUTEX_INITIALIZER; > > > > -/* Contains all 'struct dpdk_mp's. */ > > Not strictly needed, but comment is still valid This was lost in rebasing I think, will return it in v2. > > > -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex) > > - = OVS_LIST_INITIALIZER(&dpdk_mp_free_list); > > +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex) > > + = OVS_LIST_INITIALIZER(&dpdk_mp_list); > > > > -/* Wrapper for a mempool released but not yet freed. */ struct > > dpdk_mp { > > struct rte_mempool *mp; > > + int mtu; > > + int socket_id; > > + int refcount; > > struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex); > > }; > > > > @@ -381,7 +393,7 @@ struct netdev_dpdk { > > > > PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1, > > struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex); > > - struct rte_mempool *mp; > > + struct dpdk_mp *dpdk_mp; > > > > /* virtio identifier for vhost devices */ > > ovsrcu_index vid; > > @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp > OVS_UNUSED, > > dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len); } > > > > -static int > > -dpdk_mp_full(const struct rte_mempool *mp) > > OVS_REQUIRES(dpdk_mp_mutex) -{ > > - unsigned ring_count; > > - /* This logic is needed because rte_mempool_full() is not > guaranteed to > > - * be atomic and mbufs could be moved from mempool cache --> > mempool ring > > - * during the call. However, as no mbufs will be taken from the > mempool > > - * at this time, we can work around it by also checking the ring > entries > > - * separately and ensuring that they have not changed. > > - */ > > - ring_count = rte_mempool_ops_get_count(mp); > > - if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == > ring_count) { > > - return 1; > > - } > > - > > - return 0; > > -} > > - > > -/* Free unused mempools. */ > > -static void > > -dpdk_mp_sweep(void) > > +/* Calculating the required number of mbufs differs depending on the > > + * mempool model being used. Check if per port mempools are in use > > +before > > + * calculating. > > + */ > > +static uint32_t > > +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool > > +per_port_mp) > > { > > - struct dpdk_mp *dmp, *next; > > + uint32_t n_mbufs; > > > > - ovs_mutex_lock(&dpdk_mp_mutex); > > - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) { > > - if (dpdk_mp_full(dmp->mp)) { > > - VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name); > > - ovs_list_remove(&dmp->list_node); > > - rte_mempool_free(dmp->mp); > > - rte_free(dmp); > > + if (!per_port_mp) { > > + /* Shared mempools are being used. > > + * XXX: this is a really rough method of provisioning memory. > > XXX causes warnings in checkpatch (found that out on rte_mempool_full > patch) > Yes, I spotted this myself before sending the patch :). The guideline agreed as regards 'XXX' in patches is it's allowed so as to flag future work (and ensure it's not just something the author missed). There was a fair bit of discussion as regards the memory calculations on the ML previously so I expect it will change in the future, possibly a reduction or removal of the tx queues specific calculations for the per port implementation. For the v1 here I kept the comment just to highlight it could be reworked in the future. > > + * It's impossible to determine what the exact memory > requirements are > > + * when the number of ports and rxqs that utilize a particular > mempool > > + * can change dynamically at runtime. For now, use this rough > > + * heurisitic. > > + */ > > + if (mtu >= ETHER_MTU) { > > + n_mbufs = MAX_NB_MBUF; > > + } else { > > + n_mbufs = MIN_NB_MBUF; > > } > > + } else { > > + /* Per port mempools are being used. > > + * XXX: rough estimation of number of mbufs required for this > port: > > + * <packets required to fill the device rxqs> > > + * + <packets that could be stuck on other ports txqs> > > + * + <packets in the pmd threads> > > + * + <additional memory for corner cases> > > + */ > > + n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size > > + + dev->requested_n_txq * dev->requested_txq_size > > + + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * > NETDEV_MAX_BURST > > + + MIN_NB_MBUF; > > } > > - ovs_mutex_unlock(&dpdk_mp_mutex); > > -} > > - > > -/* Ensure a mempool will not be freed. */ -static void > > -dpdk_mp_do_not_free(struct rte_mempool *mp) > > OVS_REQUIRES(dpdk_mp_mutex) -{ > > - struct dpdk_mp *dmp, *next; > > > > - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) { > > - if (dmp->mp == mp) { > > - VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp- > >name); > > - ovs_list_remove(&dmp->list_node); > > - rte_free(dmp); > > - break; > > - } > > - } > > + return n_mbufs; > > } > > > > -/* Returns a valid pointer when either of the following is true: > > - * - a new mempool was just created; > > - * - a matching mempool already exists. */ -static struct > > rte_mempool * -dpdk_mp_create(struct netdev_dpdk *dev, int mtu) > > +static struct dpdk_mp * > > +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp) > > { > > char mp_name[RTE_MEMPOOL_NAMESIZE]; > > const char *netdev_name = netdev_get_name(&dev->up); > > int socket_id = dev->requested_socket_id; > > uint32_t n_mbufs; > > uint32_t hash = hash_string(netdev_name, 0); > > - struct rte_mempool *mp = NULL; > > - > > - /* > > - * XXX: rough estimation of number of mbufs required for this port: > > - * <packets required to fill the device rxqs> > > - * + <packets that could be stuck on other ports txqs> > > - * + <packets in the pmd threads> > > - * + <additional memory for corner cases> > > - */ > > - n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size > > - + dev->requested_n_txq * dev->requested_txq_size > > - + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * > NETDEV_MAX_BURST > > - + MIN_NB_MBUF; > > + struct dpdk_mp *dmp = NULL; > > + int ret; > > + > > + dmp = dpdk_rte_mzalloc(sizeof *dmp); > > + if (!dmp) { > > + return NULL; > > + } > > + dmp->socket_id = socket_id; > > + dmp->mtu = mtu; > > + dmp->refcount = 1; > > + > > + n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp); > > > > - ovs_mutex_lock(&dpdk_mp_mutex); > > do { > > /* Full DPDK memory pool name must be unique and cannot be > > * longer than RTE_MEMPOOL_NAMESIZE. */ > > - int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, > > + ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, > > "ovs%08x%02d%05d%07u", > > hash, socket_id, mtu, n_mbufs); > > you're using the hash of one dev in the name, but that mempool can later > be used by other dev, and even be no longer used by the original dev. > OTOH no one is going to know what the hash means anyway, so it shouldn't > confuse any more on debug. I agree, it's just to ensure a unique name for the mempool in per port case and would not have an impact for the shared model. I'll put a comment to explain this and that it's expected in the v2. > > > if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) { @@ -623,102 > > +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu) > > netdev_name, n_mbufs, socket_id, > > dev->requested_n_rxq, dev->requested_n_txq); > > > > - mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ, > > - sizeof (struct dp_packet) - sizeof (struct rte_mbuf), > > - MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id); > > + dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, > > + MP_CACHE_SZ, > > + sizeof (struct dp_packet) > > + - sizeof (struct rte_mbuf), > > + MBUF_SIZE(mtu) > > + - sizeof(struct dp_packet), > > + socket_id); > > > > - if (mp) { > > + if (dmp->mp) { > > VLOG_DBG("Allocated \"%s\" mempool with %u mbufs", > > mp_name, n_mbufs); > > /* rte_pktmbuf_pool_create has done some initialization of > the > > - * rte_mbuf part of each dp_packet. Some OvS specific > fields > > - * of the packet still need to be initialized by > > - * ovs_rte_pktmbuf_init. */ > > - rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL); > > + * rte_mbuf part of each dp_packet, while > ovs_rte_pktmbuf_init > > + * initializes some OVS specific fields of dp_packet. > > + */ > > + rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL); > > + return dmp; > > } else if (rte_errno == EEXIST) { > > /* A mempool with the same name already exists. We just > > * retrieve its pointer to be returned to the caller. */ > > - mp = rte_mempool_lookup(mp_name); > > + dmp->mp = rte_mempool_lookup(mp_name); > > /* As the mempool create returned EEXIST we can expect the > > * lookup has returned a valid pointer. If for some reason > > * that's not the case we keep track of it. */ > > VLOG_DBG("A mempool with name \"%s\" already exists at > %p.", > > - mp_name, mp); > > - /* Ensure this reused mempool will not be freed. */ > > - dpdk_mp_do_not_free(mp); > > The mempool you are about to reuse could have a refcount of 0 and about to > be freed by a sweep. So you would need something like the function above > before giving up dpdk_mp_mutex. Maybe you could increase the refcount for > it now and re-adjust later if you need to. > So my thinking here was that we run the sweep in netdev_dpdk_mempool_configure() just prior to calling dpdk_mp_create(). Running the sweep prior should have removed any mempool with refcount 0 in the dpdk_mp_list. The list itself is mutex guarded, if there is an existing mempool then the associated refcount would still be 1 I thought but maybe I missed something. Do you think it's the case it could be modified between the sweep call and reaching this point? > > + mp_name, dmp->mp); > > } else { > > VLOG_ERR("Failed mempool \"%s\" create request of %u > mbufs", > > mp_name, n_mbufs); > > } > > - } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= > MIN_NB_MBUF); > > + } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= > > + MIN_NB_MBUF); > > > > - ovs_mutex_unlock(&dpdk_mp_mutex); > > - return mp; > > + rte_free(dmp); > > + return NULL; > > } > > > > -/* Release an existing mempool. */ > > +static int > > +dpdk_mp_full(const struct rte_mempool *mp) > > +OVS_REQUIRES(dpdk_mp_mutex) { > > + unsigned ring_count; > > + /* This logic is needed because rte_mempool_full() is not > guaranteed to > > + * be atomic and mbufs could be moved from mempool cache --> > mempool ring > > + * during the call. However, as no mbufs will be taken from the > mempool > > + * at this time, we can work around it by also checking the ring > entries > > + * separately and ensuring that they have not changed. > > + */ > > + ring_count = rte_mempool_ops_get_count(mp); > > + if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == > ring_count) { > > + return 1; > > + } > > + > > + return 0; > > you'll need to rebase because of the changes here. Won't make any > difference to your operation though > > > +} > > + > > +/* Free unused mempools. */ > > static void > > -dpdk_mp_release(struct rte_mempool *mp) > > +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) > > { > > - if (!mp) { > > - return; > > + struct dpdk_mp *dmp, *next; > > + > > + LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) { > > + if (!dmp->refcount && dpdk_mp_full(dmp->mp)) { > > + VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name); > > + ovs_list_remove(&dmp->list_node); > > + rte_mempool_free(dmp->mp); > > + rte_free(dmp); > > + } > > } > > +} > > + > > +static struct dpdk_mp * > > +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) { > > + struct dpdk_mp *dmp; > > + bool reuse = false; > > > > ovs_mutex_lock(&dpdk_mp_mutex); > > - if (dpdk_mp_full(mp)) { > > - VLOG_DBG("Freeing mempool \"%s\"", mp->name); > > - rte_mempool_free(mp); > > - } else { > > - struct dpdk_mp *dmp; > > + /* Check if shared mempools are being used, if so check existing > mempools > > + * to see if reuse is possible. */ > > + if (!per_port_mp) { > > + LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { > > + if (dmp->socket_id == dev->requested_socket_id > > + && dmp->mtu == mtu) { > > + VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name); > > + dmp->refcount++; > > + reuse = true; > > + break; > > + } > > + } > > + } > > + /* Sweep mempools after reuse or before create. */ > > + dpdk_mp_sweep(); > > > > - dmp = dpdk_rte_mzalloc(sizeof *dmp); > > + if (!reuse) { > > + dmp = dpdk_mp_create(dev, mtu, per_port_mp); > > if (dmp) { > > - dmp->mp = mp; > > - ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node); > > + ovs_list_push_back(&dpdk_mp_list, &dmp->list_node); > > } > > } > > + > > ovs_mutex_unlock(&dpdk_mp_mutex); > > + > > + return dmp; > > } > > > > -/* Tries to allocate a new mempool - or re-use an existing one where > > - * appropriate - on requested_socket_id with a size determined by > > - * requested_mtu and requested Rx/Tx queues. > > - * On success - or when re-using an existing mempool - the new > > configuration > > - * will be applied. > > +/* Decrement reference to a mempool. */ static void > > +dpdk_mp_put(struct dpdk_mp *dmp) { > > + if (!dmp) { > > + return; > > + } > > + > > + ovs_mutex_lock(&dpdk_mp_mutex); > > + ovs_assert(dmp->refcount); > > + dmp->refcount--; > > + ovs_mutex_unlock(&dpdk_mp_mutex); } > > + > > +/* Depending on the memory model being used this function tries > > + * identify and reuse an existing mempool or tries to allocate new > > + * mempool on requested_socket_id with mbuf size corresponding to > > + * requested_mtu. On success new configuration will be applied. > > * On error, device will be left unchanged. */ static int > > netdev_dpdk_mempool_configure(struct netdev_dpdk *dev) > > OVS_REQUIRES(dev->mutex) > > { > > uint32_t buf_size = dpdk_buf_size(dev->requested_mtu); > > - struct rte_mempool *mp; > > + struct dpdk_mp *mp; > > int ret = 0; > > + bool per_port_mp = dpdk_per_port_mempool_enabled(); > > > > - dpdk_mp_sweep(); > > + /* With shared mempools we do not need to configure a mempool if > the MTU > > + * and socket ID have not changed, the previous configuration is > still > > + * valid so return 0 */ > > + if (dev->mtu == dev->requested_mtu > > + && dev->socket_id == dev->requested_socket_id > > + && (!per_port_mp)) { > > can remove brackets and probably more intuitive to check the memory model > first Yes, with your comment later as regards the refactor based in memory model I can address this. > > > + return ret; > > + } > > > > - mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size)); > > + mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp); > > if (!mp) { > > VLOG_ERR("Failed to create memory pool for netdev " > > "%s, with MTU %d on socket %d: %s\n", > > dev->up.name, dev->requested_mtu, dev- > >requested_socket_id, > > - rte_strerror(rte_errno)); > > - ret = rte_errno; > > + rte_strerror(rte_errno)); > > + return rte_errno; > > } else { > > - /* If a new MTU was requested and its rounded value equals the > one > > - * that is currently used, then the existing mempool is > returned. */ > > - if (dev->mp != mp) { > > - /* A new mempool was created, release the previous one. */ > > - dpdk_mp_release(dev->mp); > > - } else { > > - ret = EEXIST; > > + /* Check for any pre-existing dpdk_mp for the device */ > > + if (dev->dpdk_mp != NULL) { > > Not sure you need this check, dpdk_mp_put will handle NULL > Yes dpdk_mp_put will handle NULL, but if statement below for comparison of mempools attempts to access the mempools which are part of the dpdk_mp struct. In the case that a device is newly created there is no previous dpdk_mp and accessing dpdk_mp->mp will cause a segfault. > > + /* If a new MTU was requested and its rounded value equals > the > > + * one that is currently used, then the existing mempool is > > + * returned. */ > > + if (dev->dpdk_mp->mp != mp->mp) { > > + /* A new mempool was created, release the previous one. > */ > > + dpdk_mp_put(dev->dpdk_mp); > > + } else { > > + /* If the mempool already exists in the current dpdk_mp > then > > + * we need to ensure dpdk_mp that was just created is > freed in > > + * the next sweep as it will not be used. *> + > dpdk_mp_put(mp); > > This doesn't look right - it will likely lead to the mempool you are using > being freed also. I think you just want to free the dpdk_mp struct, but > not the actual mempool. Oops, good catch, your correct, we only need to free the dpdk_mp in this case but not the mempool itself. At this point the dpdk_mp has already been added to the list, it could make more sense to do this work at the dpdk_mp_create() function instead to avoid messing with the list altogether for the EEXIST case. I'll look into this for the v2. > > > + ret = EEXIST; > > Setting ret, but returning 0 below > > > + } > > } > > - dev->mp = mp; > > + dev->dpdk_mp = mp; > > dev->mtu = dev->requested_mtu; > > dev->socket_id = dev->requested_socket_id; > > dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); > > } > > > > - return ret; > > + return 0; > > } > > > > I know there was lots of paths already, but adding two different schemes > is making the code here and in create even more confusing. I think there's > a few subtle bugs above, and it's probably down to the complexity. I > wonder could you use the per_port_mp a bit more so at least the paths for > each scheme are a more segregated? When reviewing, it's hard to understand > if a code branch would happen with one scheme or the other, or both. +1, I'm looking at refactoring based on this per_port_mp the v2. The original intention was to keep the code footprint as small as possible but for readability and clarity on code paths I'd agree. Thanks Ian > > Kevin. > > > static void > > @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, > > int n_rxq, int n_txq) > > > > for (i = 0; i < n_rxq; i++) { > > diag = rte_eth_rx_queue_setup(dev->port_id, i, dev- > >rxq_size, > > - dev->socket_id, NULL, dev- > >mp); > > + dev->socket_id, NULL, > > + dev->dpdk_mp->mp); > > if (diag) { > > VLOG_INFO("Interface %s unable to setup rxq(%d): %s", > > dev->up.name, i, rte_strerror(-diag)); @@ > > -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > > memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN); > > rte_eth_link_get_nowait(dev->port_id, &dev->link); > > > > - mbp_priv = rte_mempool_get_priv(dev->mp); > > + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp); > > dev->buf_size = mbp_priv->mbuf_data_room_size - > > RTE_PKTMBUF_HEADROOM; > > > > /* Get the Flow control configuration for DPDK-ETH */ @@ -1176,7 > > +1248,7 @@ common_destruct(struct netdev_dpdk *dev) > > OVS_EXCLUDED(dev->mutex) > > { > > rte_free(dev->tx_q); > > - dpdk_mp_release(dev->mp); > > + dpdk_mp_put(dev->dpdk_mp); > > > > ovs_list_remove(&dev->list_node); > > free(ovsrcu_get_protected(struct ingress_policer *, @@ -1928,7 > > +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, > > return EAGAIN; > > } > > > > - nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp, > > + nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp, > > (struct rte_mbuf **) batch- > >packets, > > NETDEV_MAX_BURST); > > if (!nb_rx) { > > @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, > struct dp_packet_batch *batch) > > continue; > > } > > > > - pkts[txcnt] = rte_pktmbuf_alloc(dev->mp); > > + pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp); > > if (OVS_UNLIKELY(!pkts[txcnt])) { > > dropped += cnt - i; > > break; > > @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn > *conn, > > ovs_mutex_lock(&dev->mutex); > > ovs_mutex_lock(&dpdk_mp_mutex); > > > > - rte_mempool_dump(stream, dev->mp); > > + rte_mempool_dump(stream, dev->dpdk_mp->mp); > > > > ovs_mutex_unlock(&dpdk_mp_mutex); > > ovs_mutex_unlock(&dev->mutex); @@ -3742,7 +3814,7 @@ > > dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev) > > > > err = netdev_dpdk_mempool_configure(dev); > > if (!err) { > > - /* A new mempool was created. */ > > + /* A new mempool was created or re-used. */ > > netdev_change_seq_changed(&dev->up); > > } else if (err != EEXIST){ > > return err; > > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index > > 7ab90d5..95520af 100644 > > --- a/vswitchd/vswitch.xml > > +++ b/vswitchd/vswitch.xml > > @@ -359,6 +359,22 @@ > > </p> > > </column> > > > > + <column name="other_config" key="per-port-mp-enabled" > > + type='{"type": "boolean"}'> > > + <p> > > + By default OVS DPDK uses a shared mempool memory model > wherein > > + devices that have the same MTU and socket values can share > the same > > + mempool. Setting this value to <code>true</code> changes this > > + behaviour. Per port mempools allow DPDK devices to use a > private > > + mempool per device. This can provide greater transapraency as > > + regards memory usage but potentially at the cost of greater > memory > > + requirements. > > + </p> > > + <p> > > + Changing this value requires restarting the daemon. > > + </p> > > + </column> > > + > > <column name="other_config" key="tx-flush-interval" > > type='{"type": "integer", > > "minInteger": 0, "maxInteger": 1000000}'> > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev