Hi,


Thanks, everyone, for re-opening the discussion around the new packet mempool 
handling for 2.10.



Before we agree on what to actually implement I’d like to summarize my 
understanding of the requirements that have been discussed so far. Based on 
those I want to share some thoughts about how e can best address these 
requirements.



Requirements:



R1 (Backward compatibility):

The new mempool handling shall be able to function equally well as the OVS 2.9 
design base given any specific configuration of OVS-DPDK: hugepage memory, 
PMDs, ports, queues, MTU sizes, traffic flows. This is to ensure that we can 
upgrade OVS in existing deployments without risk of breaking anything.



R2 (Dimensioning for static deployments):

It shall be possible for an operator to calculate the amount of memory needed 
for packet mempools in a given static (maximum) configuration (PMDs, ethernet 
ports and queues, maximum number of vhost ports, MTU sizes) to reserve 
sufficient hugepages for OVS.



R3 (Safe operation):

If the mempools are dimensioned correctly, it shall not be possible that OVS 
runs out of mbufs for packet processing.



R4 (Minimal footprint):

The packet mempool size needed for safe operation of OVS should be as small as 
possible.



R5 (Dynamic mempool allocation):

It should be possible to automatically adjust the size of packet mempools at 
run-time when changing the OVS configuration e.g. adding PMDs, adding ports, 
adding rx/tx queues, changing the port MTU size. (Note: Shrinking the mempools 
with reducing OVS configuration is less important.)



Actual maximum mbuf consumption in OVS DPDK:


  1.  Phy rx queues: Sum over dpdk dev: (dev->requested_n_rxq * 
dev->requested_rxq_size)
Note: Normally the number of rx queues should not exceed the number of PMDs.
  2.  Phy tx queues: Sum over dpdk dev: (#active tx queues (=#PMDs) * 
dev->requested_txq_size)

Note 1: These are hogged because of DPDK PMD’s lazy release of transmitted 
mbufs.
Note 2: Stored mbufs in a tx queue are coming from all ports.

  1.  One rx batch per PMD during processing: #PMDs * 32
  2.  One batch per active tx queue for time-based batching: 32 * #devs * #PMDs



Assuming rx/tx queue size of 2K for physical ports and #rx queues = #PMDs 
(RSS), the upper limit for the used mbufs would be



(*1*)     #dpdk devs * #PMDs * 4K   +  (#dpdk devs + #vhost devs) * #PMDs * 32  
+  #PMDs * 32



Examples:

  *   With a typical NFVI deployment (2 DPDK devs, 4 PMDs, 128 vhost devs ) 
this yields  32K + 17K = 49K mbufs
  *   For a large NFVI deployment (4 DPDK devs, 8 PMDs, 256 vhost devs ) this 
would yield  128K + 66K = 194K mbufs

Roughly 1/3rd of the total mbufs are hogged in dpdk dev rx queues. The 
remaining 2/3rds are populated with an arbitrary mix of mbufs from all sources.



Legacy shared mempool handling up to OVS 2.9:


  *   One mempool per NUMA node and used MTU size range.
  *   Each mempool has the maximum of (256K, 128K, 64K, 32K or 16K) mbufs 
available in DPDK at mempool creation.
  *   Each mempool is shared among all ports on its NUMA node with an MTU in 
its range.
     *   All rx queues of a port share the same mempool



The legacy code trivially satisfies R1. Its good feature is that the mempools 
are shared so that it avoids the bloating of dedicated mempools per port 
implied by the handling on master (see below).



Apart from that it does not fulfill any of the requirements.

  *   It swallows all available hugepage memory to allocate up to 256K mbufs 
per NUMA node, even though that is far more than typically needed (violating 
R4).

  *   The actual size of created mempools depends on the order of creation and 
the hugepage memory available. Early mempools are over-dimensioned, later 
mempools might be under-dimensioned. Operation is not at all safe (violating R3)
  *   It doesn’t provide any help for the operator to dimension and reserve 
hugepages for OVS (violating R2)
  *   The only dynamicity is that it creates additional mempools for new MTU 
size ranges only when they are needed. Due to greedy initial allocation these 
are likely to fail (violating R5).



My take is that even though the shared mempool is concept is good, the legacy 
mempool handling should not be kept as is.



Mempool per port scheme (currently implemented on master):



From the above mbuf utilization calculation it is clear that only the dpdk rx 
queues are populated exclusively with mbufs from the port’s mempool. All other 
places are populated with mbufs from all ports, in the case of tx queues 
typically not even their own. As it is not possible to predict the assignment 
of rx queues to PMDs and the flow of packets between ports, safety requirement 
R3 implies that each port mempool must be dimensioned for the worst case, i.e.



[#PMDs * 2K ] +  #dpdk devs * #PMDs * 2K   +  (#dpdk devs + #vhost devs) * 
#PMDs * 32  +  #PMDs * 32



Even though the first term [#PMDs * 2K] is only needed for physical ports this 
almost multiplies the total number of mbufs needed (*1*) by the number of ports 
(dpdk and vhost) in the system.



Examples:

  *   With a typical NFVI deployment (2 DPDK devs, 4 PMDs, 128 vhost devs ) 
this yields
2 * (24K + 17K) + 128 * (16K + 17K) =  4306K mbufs
  *   For a large NFVI deployment (4 DPDK devs, 8 PMDs, 256 vhost devs ) this 
would yield
4 * (80K + 66K) + 256 * (64K + 17K) =  21320K mbufs



The required total mempool sizes needed for safe operation is ridiculously 
high. Any attempt to bring the per-port mempool model on par with the memory 
consumption of a properly dimensioned shared mempool scheme will be inherently 
unsafe. This clearly indicates that a per-port mempool model is not adequate. 
The current per-port mempool scheme on master should be removed.



One mempool per MTU range:



A similar argument as for the per-port mempool above also holds for the per-MTU 
range mempools used in the 2.9 design base. As the mbufs received on a port 
with a given MTU can be sent to any port in the system, each MTU range mempool 
must be dimensioned to a large fraction of the maximum total number of mbufs in 
use (*1*):  2/3rds + the number of rx queue descriptors for that MTU range.



Already with 2 different mbuf sizes (e.g. for MTU 9000 on phy ports and MTU 
1500 on vhu ports), dimensioning each MTU-mempool safely can require more 
memory in total than using a single mempool of the maximum needed mbuf size for 
all ports.



To address R4 (minimal footprint) we could simplify the solution and give up 
the concept of one mempool per MTU range. There are three options:

  *   Configure an mbuf size for the single mempool, which then implies an 
upper limit on the configurable MTU per port.
  *   Replace the mempool with another mempool of larger mbufs when a port is 
configured with MTU that would not fit.
  *   Use the multi-segment mbuf approach (Intel WiP patch) to satisfy MTU 
sizes that do not fit the fixed mbuf-size.



Per PMD mempools:



The following arguments suggest that a mempool pool per PMD allocated on the 
PMD’s NUMA node might make good sense:



  *   The total mbufs in use by OVS cleanly partitions into subsets per PMD:
     *   Packets hogged in dpdk rx queues are naturally owned by the PMD 
polling the rx queues
     *   Each PMD typically has its dedicated dpdk tx queue, so that all mbufs 
hogged in that tx queue are owned by the PMD.
(In the unusual case of shared tx queues we still need to assume the worst case 
that all mbufs belong to a single PMD.)
     *   Also the mbufs in flight and in tx batching buffers are owned by the 
PMD.


With the same assumptions as above, the amount of mbufs in use by a single PMD 
is bounded by


(*2*)                     #dpdk devs * 4K   +  (#dpdk devs + #vhost devs) * 32  
+  32


  *   For best performance mbufs being processed by a PMD thread should be 
local to the PMD’s NUMA socket. This is especially important for tx to 
vhostuser due to copying of entire packet content.

Today this is not the case for dpdk rx queues polled by remote PMDs (through rx 
queue pinning). All rx queues of a dpdk port are tied to a mempool on the NIC’s 
NUMA. Node. The “Fujitsu patch” presented on the OVS Conf 2016 showed that the 
performance of a remote PMD can be significantly improved by assigning a 
mempool local to the PMD for the pinned dpdk rx queue. In this case the DMA 
engine of the NIC takes care of the QPI bus transfer and the PMD is not 
burdened. DPDK supports this model as the mempool for eth devices is 
configurable per rx queue, not per port.

  *   Using the above dimensioning formula, requirements R1 to R4 could be 
fulfilled by a mempool per PMD in the same way as per NUMA mempools globally 
shared by all PMDs on that NUMA node. Requirement R5 (Dynamic allocation) would 
some extent be fulfilled also, as mempools could be added/deleted dynamically 
when PMDs are added/deleted to the OVS.



Conclusions:



I would suggest to aim for a new mempool handling along the following lines:



  *   Create mempools per PMD based on the above formula (*2*) using reasonable 
hard-coded default bounds for #dpdk devs (e.g. 8) and #vhost devs (256) such 
that the total memory remains below the 2.9 legacy.
     *   Improvement: make the these bounds configurable.

  *   Use the “Fujitsu patch approach” and assign the dpdk rx queue to the 
mempool of the polling PMD.

  *   Avoid the complexity and memory waste with multiple mempools per PMD for 
different MTU sizes.
Use one configurable common mbuf size (default e.g. 3x1024 (3KB) bytes covering 
most common MTU sizes) and multi-segment mbufs to handle larger port MTUs. For 
optimal jumbo frame performance, users would configure 10KB mbufs for the price 
of more memory needed.



Assuming 8 PMDs, 8 dpdk devs, 256 vhost devs, 2K descriptors per dpdk rx/tx 
queue and 3KB mbuf size, the resulting overall hugepage memory requirements for 
packet mempools would be:



                8 * 4K  + 264 * 32 + 32 mbufs  =  41K mbufs          per PMD

                4 PMD * 41K mbufs/PMD * 3KB  ~=  512 MB        per NUMA node 
with equal 4:4 PMD distribution



So a typical OVS deployment with 1GB hugepage memory per NUMA socket should be 
more than sufficient to cover the memory requirement for the proposed default 
mempool scheme for large NFVI deployments. Assigning 2 GB per NUMA should 
already cover the memory need for unsegmented 9KB jumbo frames.



For better compatibility with OVS 2.9 in small test setups we could consider 
maintaining a scheme to reduce the above default #mbufs per PMD mempool 
successively until they fit the available hugepage memory. In that case it 
would be good to have a WARN message in the log indicating if the created 
mempools are not sufficient to handle the actual DPDK datapath configuration 
safely.





Comments are welcome!



Jan



> >> Hi all,

> >>

> >> Now seems a good time to kick start this conversation again as there's a 
> >> few patches floating around for mempools on master and

> 2.9.

> >> I'm happy to work on a solution for this but before starting I'd like to 
> >> agree on the requirements so we're all comfortable with the

> solution.

> >>

> >

> > Thanks for kicking it off Ian. FWIW, the freeing fix code can work with

> > both schemes below. I already have that between the patches for

> > different branches. It should be straightforward to change to cover both

> > in same code. I can help with that if needed.

>

> Agree, there is no much difference between mempool models for freeing fix.

>

> >

> >> I see two use cases above, static and dynamic. Each have their own 
> >> requirements (I'm keeping OVS 2.10 in mind here as it's an

> issue we need to resolve).

> >>

> >> Static environment

> >> 1. For a given deployment, the 2.10 the mempool design should use the same 
> >> or less memory as the shared mempool design of

> 2.9.

> >> 2. Memory pool size can depend on static datapath configurations, but the 
> >> previous provisioning used in OVS 2.9 is acceptable also.

> >>

> >> I think the shared mempool model suits the static environment, it's a 
> >> rough way of provisioning memory but it works for the

> majority involved in the discussion to date.

> >>

> >> Dynamic environment

> >> 1. Mempool size should not depend on dynamic characteristics (number of 
> >> PMDs, number of ports etc.), this leads to frequent

> traffic interrupts.

> >

> > If that is wanted I think you need to distinguish between port related

> > dynamic characteristics and non-port related. At present the per port

> > scheme depends on number of rx/tx queues and the size of rx/tx queues.

> > Also, txq's depends on number of PMDs. All of which can be changed

> > dynamically.

>

> Changing of the mempool size is too heavy operation. We should

> avoid it somehow as long as possible.

>

> It'll be cool to have some kind of dynamic mempool resize API from the

> DPDK, but there is no such concepts right now. Maybe it'll be good if

> DPDK API will allow to add more than one mempool for a device. Such API

> could allow us to dynamically increase/decrease the total amount of

> memory available for a single port. We should definitely think about

> something like this in the future.

>

> >

> >> 2. Due to the dynamic environment, it's preferable for clear visibility of 
> >> memory usage for ports (Sharing mempools violates this).

> >>

> >> The current per port model suits the dynamic environment.

> >>

> >> I'd like to propose for 2.10 that we implement a model to allow both:

> >>

> >> * When adding a port the shared mempool model would be the default 
> >> behavior. This would satisfy users moving from previous

> OVS releases to 2.10 as memory requirements would be in line with what was 
> previously expected and no new options/arguments

> are needed.

> >>

> >

> > +1

>

> It's OK for me too.

>

> >

> >> * Per port mempool is available but must be requested by a user, it would 
> >> require a new option argument when adding a port.

> >

> > I'm not sure there needs to be an option *per port*. The implication is

> > that some mempools would be created exclusively for a single port, while

> > others would be available to share and this would operate at the same time.

> >

> > I think a user would either have an unknown or high number of ports and

> > are ok with provisioning the amount of memory for shared mempools, or

> > they know they will have only a few ports and can benefit from using

> > less memory.

>

> Unknown/big but limited number of ports could also be a scenario for

> separate mempool model, especially for dynamic case.

>

> >

> > Although, while it is desirable to reduce memory usage, I've never

> > actually heard anyone complaining about the amount of memory needed for

> > shared mempools and requesting it to be reduced.

>

> I agree that per-port option looks like more than users could need.

> Maybe global config will be better.

>

> There is one more thing: Users like OpenStack are definitely "dynamic".

> Addition of the new special parameter will require them to modify their

> code to have more or less manageable memory consumption.

>

> P.S. Meanwhile, I will be out of office until May 3 and will not be able

>      to respond to emails.

>

> >

> > I don't think it would be particularly difficult to have both schemes

> > operating at the same time because you could use mempool names to

> > differentiate (some with unique port related name, some with a general

> > name) and mostly treat them the same, but just not sure that it's really

> > needed.

> >

> >> This would be an advanced feature as its mempool size can depend on port 
> >> configuration, users need to understand this &

> mempool concepts in general before using this. A bit of work to be done here 
> in the docs to make this clear how memory

> requirements are calculated etc.

> >>

> >> Before going into solution details I'd like to get people's opinions. 
> >> There's a few different ways to implement this, but in general

> would the above be acceptable? I think with some smart design we could 
> minimize the code impact so that both approaches share as

> much as possible.

> >>

> >> Ian

> >>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to