Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Ilya Maximets
On 26.01.2018 15:00, Stokes, Ian wrote:
> Hi All,
> 
> Recently an issue was raised regarding the move from a single shared mempool 
> model that was in place up to OVS 2.8, to a mempool per port model introduced 
> in 2.9.
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html
> 
> The per port mempool model was introduced in September 5th with commit 
> d555d9bded to allow fine grain control on a per port case of memory usage.
> 
> In the 'common/shared mempool' model, ports sharing a similar MTU would all 
> share the same buffer mempool (e.g. the most common example of this being 
> that all ports are by default created with a 1500B MTU, and as such share the 
> same mbuf mempool).
> 
> This approach had some drawbacks however. For example, with the shared memory 
> pool model a user could exhaust the shared memory pool (for instance by 
> requesting a large number of RXQs for a port), this would cause the vSwitch 
> to crash as any remaining ports would not have the required memory to 
> function. This bug was discovered and reported to the community in late 2016 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.
> 
> The per port mempool patch aimed to avoid such issues by allocating a 
> separate buffer mempool to each port.
> 
> An issue has been flagged on ovs-discuss, whereby memory dimensions provided 
> for a given number of ports on OvS 2.6-2.8 may be insufficient to support the 
> same number of ports in OvS 2.9, on account of the per-port mempool model 
> without re-dimensioning extra memory. The effect of this is use case 
> dependent (number of ports, RXQs, MTU settings, number of PMDs etc.) The 
> previous 'common-pool' model was rudimentary in estimating the mempool size 
> and was marked as something that was to be improved upon. The memory 
> allocation calculation for per port model was modified to take the possible 
> configuration factors mentioned into account.
> 
> It's unfortunate that this came to light so close to the release code freeze 
> - but better late than never as it is a valid problem to be resolved.
> 
> I wanted to highlight some options to the community as I don't think the next 
> steps should be taken in isolation due to the impact this feature has.
> 
> There are a number of possibilities for the 2.9 release.
> 
> (i) Revert the mempool per port patches and return to the shared mempool 
> model. There are a number of features and refactoring in place on top of the 
> change so this will not be a simple revert. I'm investigating what exactly is 
> involved with this currently.

This looks like a bad thing to do. Shared memory pools has their own issues and
hides the real memory usage by each port. Also, reverting seems almost 
impossible,
we'll have to re-implement it from scratch.

> (ii) Leave the per port mempool implementation as is, flag to users that 
> memory requirements have increased. Extra memory may have to be provided on a 
> per use case basis.

That's a good point. I prefer this option if possible.

> (iii) Reduce the amount of memory allocated per mempool per port. An RFC to 
> this effect was submitted by Kevin but on follow up the feeling is that it 
> does not resolve the issue adequately.

IMHO, we should not reduce the mempool size by hardcoding small value. Different
installations has their own requirements because of different numbers of HW NICs
and CPU cores. I beleive that 20-30K is a normal size per-port for now.

How about to add an upper limit for mempool size configurable in boot time?
See below code for example (not tested):
---
diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
index 3602180..ed54e06 100644
--- a/lib/dpdk-stub.c
+++ b/lib/dpdk-stub.c
@@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void)
 {
 return false;
 }
+
+uint32_t
+dpdk_mempool_size_limit(void)
+{
+return UINT32_MAX;
+}
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 6710d10..219e412 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -43,6 +43,9 @@ static FILE *log_stream = NULL;   /* Stream for DPDK log 
redirection */
 static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
 static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
 
+/* Maximum number of mbufs in a single memory pool. */
+static uint32_t per_port_mempool_size_limit = UINT32_MAX;
+
 static int
 process_vhost_flags(char *flag, const char *default_val, int size,
 const struct smap *ovs_other_config,
@@ -313,6 +316,7 @@ dpdk_init__(const struct smap *ovs_other_config)
 int err = 0;
 cpu_set_t cpuset;
 char *sock_dir_subcomponent;
+uint64_t size_u64;
 
 log_stream = fopencookie(NULL, "w+", dpdk_log_func);
 if (log_stream == NULL) {
@@ -351,6 +355,14 @@ dpdk_init__(const struct smap *ovs_other_config)
 VLOG_INFO("IOMMU support for vhost-user-client %s.",
vhost_iommu_enabled ? 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Kevin Traynor
hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
> On 26.01.2018 15:00, Stokes, Ian wrote:
>> Hi All,
>>
>> Recently an issue was raised regarding the move from a single shared mempool 
>> model that was in place up to OVS 2.8, to a mempool per port model 
>> introduced in 2.9.
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html
>>
>> The per port mempool model was introduced in September 5th with commit 
>> d555d9bded to allow fine grain control on a per port case of memory usage.
>>
>> In the 'common/shared mempool' model, ports sharing a similar MTU would all 
>> share the same buffer mempool (e.g. the most common example of this being 
>> that all ports are by default created with a 1500B MTU, and as such share 
>> the same mbuf mempool).
>>
>> This approach had some drawbacks however. For example, with the shared 
>> memory pool model a user could exhaust the shared memory pool (for instance 
>> by requesting a large number of RXQs for a port), this would cause the 
>> vSwitch to crash as any remaining ports would not have the required memory 
>> to function. This bug was discovered and reported to the community in late 
>> 2016 
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.
>>
>> The per port mempool patch aimed to avoid such issues by allocating a 
>> separate buffer mempool to each port.
>>
>> An issue has been flagged on ovs-discuss, whereby memory dimensions provided 
>> for a given number of ports on OvS 2.6-2.8 may be insufficient to support 
>> the same number of ports in OvS 2.9, on account of the per-port mempool 
>> model without re-dimensioning extra memory. The effect of this is use case 
>> dependent (number of ports, RXQs, MTU settings, number of PMDs etc.) The 
>> previous 'common-pool' model was rudimentary in estimating the mempool size 
>> and was marked as something that was to be improved upon. The memory 
>> allocation calculation for per port model was modified to take the possible 
>> configuration factors mentioned into account.
>>
>> It's unfortunate that this came to light so close to the release code freeze 
>> - but better late than never as it is a valid problem to be resolved.
>>
>> I wanted to highlight some options to the community as I don't think the 
>> next steps should be taken in isolation due to the impact this feature has.
>>
>> There are a number of possibilities for the 2.9 release.
>>
>> (i) Revert the mempool per port patches and return to the shared mempool 
>> model. There are a number of features and refactoring in place on top of the 
>> change so this will not be a simple revert. I'm investigating what exactly 
>> is involved with this currently.
> 
> This looks like a bad thing to do. Shared memory pools has their own issues 
> and
> hides the real memory usage by each port. Also, reverting seems almost 
> impossible,
> we'll have to re-implement it from scratch.
>>> (ii) Leave the per port mempool implementation as is, flag to users
that memory requirements have increased. Extra memory may have to be
provided on a per use case basis.
> 
> That's a good point. I prefer this option if possible.

With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports
per socket now. 9 ports if 1 Rxq. Of course a user may have some
headroom in the memory they have allocated but that can't be assumed. It
will surely mean some users setup will not work after upgrade - then
they will have to debug why.

If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4
rxqs, or 13 ports/1 rxqs without additional memory. It still gives
24K/18K buffers per port.

What do you think?

> 
>> (iii) Reduce the amount of memory allocated per mempool per port. An RFC to 
>> this effect was submitted by Kevin but on follow up the feeling is that it 
>> does not resolve the issue adequately.
> 

Right, we discussed it in the other thread. It fixes some of the missed
factors for worst case, but Pradeep pointed out there are more. To
account for them all would be difficult and mean a lot of extra mbufs
per port.

> IMHO, we should not reduce the mempool size by hardcoding small value. 
> Different
> installations has their own requirements because of different numbers of HW 
> NICs
> and CPU cores. I beleive that 20-30K is a normal size per-port for now.
> 

Would be not too far off that range with suggested change above.

> How about to add an upper limit for mempool size configurable in boot time?

I don't think an OVS user would know how to calculate the number of
mbufs. Even if they did, setting it would constitute them okaying that
they may run out of mbufs on a port.

> See below code for example (not tested):
> ---
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 3602180..ed54e06 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void)
>  {
>  return false;
>  }

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Ilya Maximets
On 26.01.2018 18:47, Kevin Traynor wrote:
> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>> On 26.01.2018 15:00, Stokes, Ian wrote:
>>> Hi All,
>>>
>>> Recently an issue was raised regarding the move from a single shared 
>>> mempool model that was in place up to OVS 2.8, to a mempool per port model 
>>> introduced in 2.9.
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html
>>>
>>> The per port mempool model was introduced in September 5th with commit 
>>> d555d9bded to allow fine grain control on a per port case of memory usage.
>>>
>>> In the 'common/shared mempool' model, ports sharing a similar MTU would all 
>>> share the same buffer mempool (e.g. the most common example of this being 
>>> that all ports are by default created with a 1500B MTU, and as such share 
>>> the same mbuf mempool).
>>>
>>> This approach had some drawbacks however. For example, with the shared 
>>> memory pool model a user could exhaust the shared memory pool (for instance 
>>> by requesting a large number of RXQs for a port), this would cause the 
>>> vSwitch to crash as any remaining ports would not have the required memory 
>>> to function. This bug was discovered and reported to the community in late 
>>> 2016 
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.
>>>
>>> The per port mempool patch aimed to avoid such issues by allocating a 
>>> separate buffer mempool to each port.
>>>
>>> An issue has been flagged on ovs-discuss, whereby memory dimensions 
>>> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to 
>>> support the same number of ports in OvS 2.9, on account of the per-port 
>>> mempool model without re-dimensioning extra memory. The effect of this is 
>>> use case dependent (number of ports, RXQs, MTU settings, number of PMDs 
>>> etc.) The previous 'common-pool' model was rudimentary in estimating the 
>>> mempool size and was marked as something that was to be improved upon. The 
>>> memory allocation calculation for per port model was modified to take the 
>>> possible configuration factors mentioned into account.
>>>
>>> It's unfortunate that this came to light so close to the release code 
>>> freeze - but better late than never as it is a valid problem to be resolved.
>>>
>>> I wanted to highlight some options to the community as I don't think the 
>>> next steps should be taken in isolation due to the impact this feature has.
>>>
>>> There are a number of possibilities for the 2.9 release.
>>>
>>> (i) Revert the mempool per port patches and return to the shared mempool 
>>> model. There are a number of features and refactoring in place on top of 
>>> the change so this will not be a simple revert. I'm investigating what 
>>> exactly is involved with this currently.
>>
>> This looks like a bad thing to do. Shared memory pools has their own issues 
>> and
>> hides the real memory usage by each port. Also, reverting seems almost 
>> impossible,
>> we'll have to re-implement it from scratch.
 (ii) Leave the per port mempool implementation as is, flag to users
> that memory requirements have increased. Extra memory may have to be
> provided on a per use case basis.
>>
>> That's a good point. I prefer this option if possible.
> 
> With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports
> per socket now. 9 ports if 1 Rxq. Of course a user may have some
> headroom in the memory they have allocated but that can't be assumed. It
> will surely mean some users setup will not work after upgrade - then
> they will have to debug why.
> 
> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4
> rxqs, or 13 ports/1 rxqs without additional memory. It still gives
> 24K/18K buffers per port.
> 
> What do you think?

You're talking mostly about physical ports. But what about virtual?
They will have only ~10K mbufs.

In my usual installation I have bonding with 2 HW NICs with 4096
descriptors in each. --> 8192 mbufs will be consumed only by HW
TX queues. With some fluctuations and possible sends to different
TX queues on HW and possible sends to another HW NICs I'll have
almost 100% probability of mempool exhausting even on that simple
installation. But there are more complicated installations too
(4 bonded NICs, for example).

> 
>>
>>> (iii) Reduce the amount of memory allocated per mempool per port. An RFC to 
>>> this effect was submitted by Kevin but on follow up the feeling is that it 
>>> does not resolve the issue adequately.
>>
> 
> Right, we discussed it in the other thread. It fixes some of the missed
> factors for worst case, but Pradeep pointed out there are more. To
> account for them all would be difficult and mean a lot of extra mbufs
> per port.
> 
>> IMHO, we should not reduce the mempool size by hardcoding small value. 
>> Different
>> installations has their own requirements because of different numbers of HW 
>> NICs
>> and CPU cores. I beleive that 20-30K is a normal size per-por

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Stokes, Ian
> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Friday, January 26, 2018 3:48 PM
> To: Ilya Maximets ; Stokes, Ian
> ; ovs-discuss@openvswitch.org
> Cc: Flavio Leitner ; Loftus, Ciara
> ; Kavanagh, Mark B ;
> Jan Scheurich (jan.scheur...@ericsson.com) ;
> Ben Pfaff (b...@ovn.org) ; acon...@redhat.com; Venkatesan
> Pradeep 
> Subject: Re: Mempool issue for OVS 2.9
> 
> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
> > On 26.01.2018 15:00, Stokes, Ian wrote:
> >> Hi All,
> >>
> >> Recently an issue was raised regarding the move from a single shared
> mempool model that was in place up to OVS 2.8, to a mempool per port model
> introduced in 2.9.
> >>
> >> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/04602
> >> 1.html
> >>
> >> The per port mempool model was introduced in September 5th with commit
> d555d9bded to allow fine grain control on a per port case of memory usage.
> >>
> >> In the 'common/shared mempool' model, ports sharing a similar MTU would
> all share the same buffer mempool (e.g. the most common example of this
> being that all ports are by default created with a 1500B MTU, and as such
> share the same mbuf mempool).
> >>
> >> This approach had some drawbacks however. For example, with the shared
> memory pool model a user could exhaust the shared memory pool (for
> instance by requesting a large number of RXQs for a port), this would
> cause the vSwitch to crash as any remaining ports would not have the
> required memory to function. This bug was discovered and reported to the
> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
> discuss/2016-September/042560.html.
> >>
> >> The per port mempool patch aimed to avoid such issues by allocating a
> separate buffer mempool to each port.
> >>
> >> An issue has been flagged on ovs-discuss, whereby memory dimensions
> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to
> support the same number of ports in OvS 2.9, on account of the per-port
> mempool model without re-dimensioning extra memory. The effect of this is
> use case dependent (number of ports, RXQs, MTU settings, number of PMDs
> etc.) The previous 'common-pool' model was rudimentary in estimating the
> mempool size and was marked as something that was to be improved upon. The
> memory allocation calculation for per port model was modified to take the
> possible configuration factors mentioned into account.
> >>
> >> It's unfortunate that this came to light so close to the release code
> freeze - but better late than never as it is a valid problem to be
> resolved.
> >>
> >> I wanted to highlight some options to the community as I don't think
> the next steps should be taken in isolation due to the impact this feature
> has.
> >>
> >> There are a number of possibilities for the 2.9 release.
> >>
> >> (i) Revert the mempool per port patches and return to the shared
> mempool model. There are a number of features and refactoring in place on
> top of the change so this will not be a simple revert. I'm investigating
> what exactly is involved with this currently.
> >
> > This looks like a bad thing to do. Shared memory pools has their own
> > issues and hides the real memory usage by each port. Also, reverting
> > seems almost impossible, we'll have to re-implement it from scratch.

I would agree, reverting isn’t as straight forward an option due to the amount 
of commits that were introduced in relation to the per port mempool feature 
over time(listed below for completeness).

netdev-dpdk: Create separate memory pool for each port: d555d9b
netdev-dpdk: fix management of pre-existing mempools:b6b26021d
netdev-dpdk: Fix mempool names to reflect socket id: f06546a
netdev-dpdk: skip init for existing mempools: 837c176
netdev-dpdk: manage failure in mempool name creation: 65056fd 
netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
netdev-dpdk: Fix mempool creation with large MTU: af5b0da
netdev-dpdk: Add debug appctl to get mempool information: be48173

Although a lot of these are fixes/formatting, we would have to introduce a new 
series and re-introduce shared model by removing the per port specifics. This 
will require extensive testing also.

Functionality such as the mempool debug tool would also be impacted as well 
(I'm not sure how valuable it would be for 1 shared mempool).

I can look at putting an RFC for this together so as to keep the option on the 
table if it's preferred within the 2.9 time window.

> >>> (ii) Leave the per port mempool implementation as is, flag to users
> that memory requirements have increased. Extra memory may have to be
> provided on a per use case basis.
> >
> > That's a good point. I prefer this optio

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Flavio Leitner
On Fri, Jan 26, 2018 at 05:16:02PM +0300, Ilya Maximets wrote:
> On 26.01.2018 15:00, Stokes, Ian wrote:
> > Hi All,
> > 
> > Recently an issue was raised regarding the move from a single shared 
> > mempool model that was in place up to OVS 2.8, to a mempool per port model 
> > introduced in 2.9.
> > 
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html
> > 
> > The per port mempool model was introduced in September 5th with commit 
> > d555d9bded to allow fine grain control on a per port case of memory usage.
> > 
> > In the 'common/shared mempool' model, ports sharing a similar MTU would all 
> > share the same buffer mempool (e.g. the most common example of this being 
> > that all ports are by default created with a 1500B MTU, and as such share 
> > the same mbuf mempool).
> > 
> > This approach had some drawbacks however. For example, with the shared 
> > memory pool model a user could exhaust the shared memory pool (for instance 
> > by requesting a large number of RXQs for a port), this would cause the 
> > vSwitch to crash as any remaining ports would not have the required memory 
> > to function. This bug was discovered and reported to the community in late 
> > 2016 
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.
> > 
> > The per port mempool patch aimed to avoid such issues by allocating a 
> > separate buffer mempool to each port.
> > 
> > An issue has been flagged on ovs-discuss, whereby memory dimensions 
> > provided for a given number of ports on OvS 2.6-2.8 may be insufficient to 
> > support the same number of ports in OvS 2.9, on account of the per-port 
> > mempool model without re-dimensioning extra memory. The effect of this is 
> > use case dependent (number of ports, RXQs, MTU settings, number of PMDs 
> > etc.) The previous 'common-pool' model was rudimentary in estimating the 
> > mempool size and was marked as something that was to be improved upon. The 
> > memory allocation calculation for per port model was modified to take the 
> > possible configuration factors mentioned into account.
> > 
> > It's unfortunate that this came to light so close to the release code 
> > freeze - but better late than never as it is a valid problem to be resolved.
> > 
> > I wanted to highlight some options to the community as I don't think the 
> > next steps should be taken in isolation due to the impact this feature has.
> > 
> > There are a number of possibilities for the 2.9 release.
> > 
> > (i) Revert the mempool per port patches and return to the shared mempool 
> > model. There are a number of features and refactoring in place on top of 
> > the change so this will not be a simple revert. I'm investigating what 
> > exactly is involved with this currently.
> 
> This looks like a bad thing to do. Shared memory pools has their own issues 
> and
> hides the real memory usage by each port. Also, reverting seems almost 
> impossible,
> we'll have to re-implement it from scratch.

Agreed.

> > (ii) Leave the per port mempool implementation as is, flag to users that 
> > memory requirements have increased. Extra memory may have to be provided on 
> > a per use case basis.
> 
> That's a good point. I prefer this option if possible.

I was told we have ways to handle this change outside of OVS which is
not ideal, but doable.

It would be nice to have a doc explaining how to estimate the new
amount of memory required though.

 
> > (iii) Reduce the amount of memory allocated per mempool per port. An RFC to 
> > this effect was submitted by Kevin but on follow up the feeling is that it 
> > does not resolve the issue adequately.
> 
> IMHO, we should not reduce the mempool size by hardcoding small value. 
> Different
> installations has their own requirements because of different numbers of HW 
> NICs
> and CPU cores. I beleive that 20-30K is a normal size per-port for now.
> 
> How about to add an upper limit for mempool size configurable in boot time?
> See below code for example (not tested):

Thanks for the quick patch.  I see two problems:
1) Existing environments might stop working after an upgrade due to
the insufficient memory.  In this case, this solution would help but
then one could simply increase the amount of hugepages and move on as
it requires manual intervention in the same way.

2) New installations would need to follow the new documentation and
not sure how this would help here.

My concern is that this is another tunable to consider and most
probably to be misused, unless we think the number of buffers is over
estimated now.


> ---
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 3602180..ed54e06 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void)

[...]

> > (iv) Introduce a feature to allow users to configure mempool as
> > shared or on a per port basis: This would be the best o

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Jan Scheurich
> -Original Message-
> From: Stokes, Ian [mailto:ian.sto...@intel.com]
> Sent: Friday, 26 January, 2018 13:01
> To: ovs-discuss@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner ; 
> Ilya Maximets (i.maxim...@samsung.com)
> ; Loftus, Ciara ; Kavanagh, 
> Mark B ; Jan Scheurich
> ; Ben Pfaff (b...@ovn.org) ; 
> acon...@redhat.com; Venkatesan Pradeep
> 
> Subject: Mempool issue for OVS 2.9
> 
> Hi All,
> 
> Recently an issue was raised regarding the move from a single shared mempool 
> model that was in place up to OVS 2.8, to a mempool per
> port model introduced in 2.9.
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html
> 
> The per port mempool model was introduced in September 5th with commit 
> d555d9bded to allow fine grain control on a per port case of
> memory usage.
> 
> In the 'common/shared mempool' model, ports sharing a similar MTU would all 
> share the same buffer mempool (e.g. the most common
> example of this being that all ports are by default created with a 1500B MTU, 
> and as such share the same mbuf mempool).
> 
> This approach had some drawbacks however. For example, with the shared memory 
> pool model a user could exhaust the shared memory
> pool (for instance by requesting a large number of RXQs for a port), this 
> would cause the vSwitch to crash as any remaining ports would
> not have the required memory to function. This bug was discovered and 
> reported to the community in late 2016
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.
> 
> The per port mempool patch aimed to avoid such issues by allocating a 
> separate buffer mempool to each port.
> 
> An issue has been flagged on ovs-discuss, whereby memory dimensions provided 
> for a given number of ports on OvS 2.6-2.8 may be
> insufficient to support the same number of ports in OvS 2.9, on account of 
> the per-port mempool model without re-dimensioning extra
> memory. The effect of this is use case dependent (number of ports, RXQs, MTU 
> settings, number of PMDs etc.) The previous 'common-
> pool' model was rudimentary in estimating the mempool size and was marked as 
> something that was to be improved upon. The memory
> allocation calculation for per port model was modified to take the possible 
> configuration factors mentioned into account.
> 
> It's unfortunate that this came to light so close to the release code freeze 
> - but better late than never as it is a valid problem to be
> resolved.
> 
> I wanted to highlight some options to the community as I don't think the next 
> steps should be taken in isolation due to the impact this
> feature has.
> 
> There are a number of possibilities for the 2.9 release.
> 
> (i) Revert the mempool per port patches and return to the shared mempool 
> model. There are a number of features and refactoring in
> place on top of the change so this will not be a simple revert. I'm 
> investigating what exactly is involved with this currently.

The shared mempool concept has been working fairly well in our NFVI systems for 
a long time. One can argue that it is too simplistic (as it does not at all 
take the number of ports and queue lengths into account) and may be 
insufficient in specific scenarios but at least it can operate with a 
reasonable amount of memory. 

Out of the very many vhostuser ports only very few actually carry significant 
amount of traffic. To reserve the theoretical worst case amount of buffers for 
every port separately is complete overkill.

> (ii) Leave the per port mempool implementation as is, flag to users that 
> memory requirements have increased. Extra memory may have
> to be provided on a per use case basis.

[Jan] From my point of view this a blocker for roll-out of OVS 2.9 in existing 
NFVI Cloud system. On these systems the amount of huge pages allocated to OVS 
is fixed by zero day configuration and there is no easy way to change that 
memory allocation as part of a SW upgrade procedure. 

As such servers host a significant number of vhostuser ports and rx queues (in 
the order of 100 or more) the new per-port mempool scheme will likely cause OVS 
to fail after SW upgrade.

> (iii) Reduce the amount of memory allocated per mempool per port. An RFC to 
> this effect was submitted by Kevin but on follow up the
> feeling is that it does not resolve the issue adequately.

If we can't get the shared mempool back into 2.9. from day one, this might be 
an emergency measure to support a reasonable number of ports with typically 
deployed huge-page assignments. 

> (iv) Introduce a feature to allow users to configure mempool as shared or on 
> a per port basis: This would be the best of both worlds but
> given the proximity to the 2.9 freeze I don't think it's feasible by the end 
> of January.

Yes, too late now. But something that should be backported to 2.9 as soon as we 
have it on master.

I think that we should aim for the combination of adaptive shared mempools as 
default and e

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-26 Thread Venkatesan Pradeep
Response marked [Pradeep]

Thanks,

Pradeep

-Original Message-
From: Jan Scheurich 
Sent: Friday, January 26, 2018 10:26 PM
To: Stokes, Ian ; ovs-discuss@openvswitch.org
Cc: Kevin Traynor ; Flavio Leitner ; Ilya 
Maximets (i.maxim...@samsung.com) ; Loftus, Ciara 
; Kavanagh, Mark B ; Ben 
Pfaff (b...@ovn.org) ; acon...@redhat.com; Venkatesan Pradeep 

Subject: RE: Mempool issue for OVS 2.9

> -Original Message-
> From: Stokes, Ian [mailto:ian.sto...@intel.com]
> Sent: Friday, 26 January, 2018 13:01
> To: ovs-discuss@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner 
> ; Ilya Maximets (i.maxim...@samsung.com) 
> ; Loftus, Ciara ; 
> Kavanagh, Mark B ; Jan Scheurich 
> ; Ben Pfaff (b...@ovn.org) ; 
> acon...@redhat.com; Venkatesan Pradeep 
> 
> Subject: Mempool issue for OVS 2.9
> 
> Hi All,
> 
> Recently an issue was raised regarding the move from a single shared 
> mempool model that was in place up to OVS 2.8, to a mempool per port model 
> introduced in 2.9.
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021
> .html
> 
> The per port mempool model was introduced in September 5th with commit 
> d555d9bded to allow fine grain control on a per port case of memory usage.
> 
> In the 'common/shared mempool' model, ports sharing a similar MTU 
> would all share the same buffer mempool (e.g. the most common example of this 
> being that all ports are by default created with a 1500B MTU, and as such 
> share the same mbuf mempool).
> 
> This approach had some drawbacks however. For example, with the shared 
> memory pool model a user could exhaust the shared memory pool (for 
> instance by requesting a large number of RXQs for a port), this would 
> cause the vSwitch to crash as any remaining ports would not have the required 
> memory to function. This bug was discovered and reported to the community in 
> late 2016 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.
> 
> The per port mempool patch aimed to avoid such issues by allocating a 
> separate buffer mempool to each port.
> 
> An issue has been flagged on ovs-discuss, whereby memory dimensions 
> provided for a given number of ports on OvS 2.6-2.8 may be 
> insufficient to support the same number of ports in OvS 2.9, on 
> account of the per-port mempool model without re-dimensioning extra 
> memory. The effect of this is use case dependent (number of ports, RXQs, MTU 
> settings, number of PMDs etc.) The previous 'common- pool' model was 
> rudimentary in estimating the mempool size and was marked as something that 
> was to be improved upon. The memory allocation calculation for per port model 
> was modified to take the possible configuration factors mentioned into 
> account.
> 
> It's unfortunate that this came to light so close to the release code 
> freeze - but better late than never as it is a valid problem to be resolved.
> 
> I wanted to highlight some options to the community as I don't think 
> the next steps should be taken in isolation due to the impact this feature 
> has.
> 
> There are a number of possibilities for the 2.9 release.
> 
> (i) Revert the mempool per port patches and return to the shared 
> mempool model. There are a number of features and refactoring in place on top 
> of the change so this will not be a simple revert. I'm investigating what 
> exactly is involved with this currently.

The shared mempool concept has been working fairly well in our NFVI systems for 
a long time. One can argue that it is too simplistic (as it does not at all 
take the number of ports and queue lengths into account) and may be 
insufficient in specific scenarios but at least it can operate with a 
reasonable amount of memory. 

Out of the very many vhostuser ports only very few actually carry significant 
amount of traffic. To reserve the theoretical worst case amount of buffers for 
every port separately is complete overkill.

> (ii) Leave the per port mempool implementation as is, flag to users 
> that memory requirements have increased. Extra memory may have to be provided 
> on a per use case basis.

[Jan] From my point of view this a blocker for roll-out of OVS 2.9 in existing 
NFVI Cloud system. On these systems the amount of huge pages allocated to OVS 
is fixed by zero day configuration and there is no easy way to change that 
memory allocation as part of a SW upgrade procedure. 

As such servers host a significant number of vhostuser ports and rx queues (in 
the order of 100 or more) the new per-port mempool scheme will likely cause OVS 
to fail after SW upgrade.

> (iii) Reduce the amount of memory allocated per mempool per port. An 
> RFC to this effect was submitted by Kevin but on follow up the feeling is 
> that it does not resolve the issue adequately.

If we can't get the shared mempool back into 2.9. from day one, this might be 
an emergency measure to support a reasonable number of ports with typically 
deployed huge-page assignments. 

[Pradeep] Th

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-28 Thread Venkatesan Pradeep
Some more observations

Regards,

Pradeep

> -Original Message-
> From: ovs-discuss-boun...@openvswitch.org [mailto:ovs-discuss-
> boun...@openvswitch.org] On Behalf Of Venkatesan Pradeep
> Sent: Friday, January 26, 2018 11:04 PM
> To: Jan Scheurich ; Stokes, Ian
> ; ovs-discuss@openvswitch.org
> Cc: Ilya Maximets (i.maxim...@samsung.com) ;
> Flavio Leitner 
> Subject: Re: [ovs-discuss] Mempool issue for OVS 2.9
> 
> Response marked [Pradeep]
> 
> Thanks,
> 
> Pradeep
> 
> -Original Message-
> From: Jan Scheurich
> Sent: Friday, January 26, 2018 10:26 PM
> To: Stokes, Ian ; ovs-discuss@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner ;
> Ilya Maximets (i.maxim...@samsung.com) ;
> Loftus, Ciara ; Kavanagh, Mark B
> ; Ben Pfaff (b...@ovn.org) ;
> acon...@redhat.com; Venkatesan Pradeep
> 
> Subject: RE: Mempool issue for OVS 2.9
> 
> > -Original Message-
> > From: Stokes, Ian [mailto:ian.sto...@intel.com]
> > Sent: Friday, 26 January, 2018 13:01
> > To: ovs-discuss@openvswitch.org
> > Cc: Kevin Traynor ; Flavio Leitner
> > ; Ilya Maximets (i.maxim...@samsung.com)
> > ; Loftus, Ciara ;
> > Kavanagh, Mark B ; Jan Scheurich
> > ; Ben Pfaff (b...@ovn.org) ;
> > acon...@redhat.com; Venkatesan Pradeep
> > 
> > Subject: Mempool issue for OVS 2.9
> >
> > Hi All,
> >
> > Recently an issue was raised regarding the move from a single shared
> > mempool model that was in place up to OVS 2.8, to a mempool per port
> model introduced in 2.9.
> >
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021
> > .html
> >
> > The per port mempool model was introduced in September 5th with commit
> > d555d9bded to allow fine grain control on a per port case of memory usage.
> >
> > In the 'common/shared mempool' model, ports sharing a similar MTU
> > would all share the same buffer mempool (e.g. the most common example of
> this being that all ports are by default created with a 1500B MTU, and as such
> share the same mbuf mempool).
> >
> > This approach had some drawbacks however. For example, with the shared
> > memory pool model a user could exhaust the shared memory pool (for
> > instance by requesting a large number of RXQs for a port), this would
> > cause the vSwitch to crash as any remaining ports would not have the
> required memory to function. This bug was discovered and reported to the
> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
> discuss/2016-September/042560.html.
> >
> > The per port mempool patch aimed to avoid such issues by allocating a
> separate buffer mempool to each port.
> >
> > An issue has been flagged on ovs-discuss, whereby memory dimensions
> > provided for a given number of ports on OvS 2.6-2.8 may be
> > insufficient to support the same number of ports in OvS 2.9, on
> > account of the per-port mempool model without re-dimensioning extra
> > memory. The effect of this is use case dependent (number of ports, RXQs,
> MTU settings, number of PMDs etc.) The previous 'common- pool' model was
> rudimentary in estimating the mempool size and was marked as something
> that was to be improved upon. The memory allocation calculation for per port
> model was modified to take the possible configuration factors mentioned into
> account.
> >
> > It's unfortunate that this came to light so close to the release code
> > freeze - but better late than never as it is a valid problem to be resolved.
> >
> > I wanted to highlight some options to the community as I don't think
> > the next steps should be taken in isolation due to the impact this feature 
> > has.
> >
> > There are a number of possibilities for the 2.9 release.
> >
> > (i) Revert the mempool per port patches and return to the shared
> > mempool model. There are a number of features and refactoring in place on
> top of the change so this will not be a simple revert. I'm investigating what
> exactly is involved with this currently.
> 
> The shared mempool concept has been working fairly well in our NFVI systems
> for a long time. One can argue that it is too simplistic (as it does not at 
> all take
> the number of ports and queue lengths into account) and may be insufficient in
> specific scenarios but at least it can operate with a reasonable amount of
> memory.
> 
> Out of the very many vhostuser ports only very few actually carry significant
> amount of traffic. To reserve the theoretical worst case amount of buffers for
> every port separately is complete overkill.
> 
> > (ii

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Ilya Maximets
On 29.01.2018 11:50, Jan Scheurich wrote:
>> -Original Message-
>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>> Sent: Monday, 29 January, 2018 09:35
>> To: Jan Scheurich ; Venkatesan Pradeep 
>> ; Stokes, Ian
>> ; d...@openvswitch.org
>> Cc: Kevin Traynor ; Flavio Leitner ; 
>> Loftus, Ciara ; Kavanagh, Mark B
>> ; Ben Pfaff (b...@ovn.org) ; 
>> acon...@redhat.com; disc...@openvswitch.org
>> Subject: Re: Mempool issue for OVS 2.9
>>
>> On 29.01.2018 11:19, Jan Scheurich wrote:
>>> Hi,
>>>
>>> I'd like to take one step back and look at how much many mbufs we actually 
>>> need.
>>>
>>> Today mbufs are consumed in the following places:
>>>
>>>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * 
>>> dev->requested_rxq_size
>>> Note 1: These mbufs are hogged up at all times.
>>> Note 2: There is little point in configuring more rx queues per phy 
>>> port than there are PMDs to poll them.
>>> Note 3: The rx queues of vhostuser ports exist as virtqueues in the 
>>> guest and do not hog mbufs.
>>>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
>>>  3. One batch per tx queue with time-based tx batching: 
>>> dev->requested_n_txq * NETDEV_MAX_BURST
>>>  4. Tx queues of **physical** ports: dev->requested_n_txq * expected peak 
>>> tx queue fill level
>>> Note 1:  The maximum of 2K mbufs per tx queue can only be reached if 
>>> the OVS transmit rate exceeds the line rate for a long time.
>> This can only happen for large packets and when the traffic originates from 
>> VMs on the compute node. This would be a case of under-
>> dimensioning and packets would be dropped in any case. Excluding that 
>> scenario, a typical peak tx queue fill level would be when all
>> PMDs transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
>>
>> Above assumption is wrong. Just look at ixgbe driver:
>> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
>>
>>/*
>> * Begin scanning the H/W ring for done descriptors when the
>> * number of available descriptors drops below tx_free_thresh.  For
>> * each done descriptor, free the associated buffer.
>> */
>>if (txq->nb_tx_free < txq->tx_free_thresh)
>>┊   ixgbe_tx_free_bufs(txq);
>>
>> The default value for 'tx_free_thresh' is 32. So, if I'll configure number
>> of TX descriptors to 4096, driver will start to free mbufs only when it will
>> have more than 4063 mbufs inside its TX queue. No matter how frequent calls
>> to send() function.
> 
> OK, but that doesn't change my general argument. The mbufs hogged in the tx 
> side of the phy port driver are coming from all ports (least likely the port 
> itself). Considering them in dimensioning the port's private mempool is 
> conceptually wrong. In my simplified dimensioning formula below I have 
> already assumed full occupancy of the tx queue for phy ports. The second key 
> observation is that vhostuser ports do not hog mbufs at all. And vhost zero 
> copy doesn't change that.

Formula below maybe good for static environment. I want to change number of
PMD threads dynamically in my deployments and this working in current per-port
model and with oversized shared pool. If we'll try to reduce memory consumption
of the shared pool we'll have to reconfigure all the devices each time we
change the number of PMD threads. This would be really bad.
So, size of the memory pool should not depend on dynamic characteristics of the
datapath or other ports to avoid unexpected interrupts in traffic flows in case
of random changes in configuration. Of course, it could depend on 
characteristics
of the port itself in case of per-port model. In case of shared mempool model 
the
size should only depend on static datapath configuration.

> 
> BTW:, is there any reason why phy drivers should free tx mbufs only when the 
> tx ring is close to becoming full? I'd understand that want need to free them 
> in batches for performance reasons, but is there no cheap possibility to do 
> this earlier?
> 
>>
>>> Note 2: Vhostuser ports do not store mbufs in tx queues due to copying 
>>> to virtio descriptors
>>>
>>>
>>> For concreteness let us use an example of a typical, rather large OVS 
>>> deployment in an NFVI cloud:
>>>
>>>   * Two cores with 4 PMDs per NUMA socket using HT.
>>>   * Two physical ports using RSS over 4 rx queues to enable load-sharing 
>>> over the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
>>>   * 100 vhostuser ports with a varying number of rx and tx queue pairs (128 
>>> in total).
>>>
>>>
>>> In the above example deployments this translates into
>>>
>>>  1. 4 * 2K = 8K mbufs per physical port (16K in total)
>>>  2. 8 * 32 = 256 mbufs total
>>>  3. (128 +  23*9) * 32 = 4672 mbufs in total
>>>  4. 9 * 32 = 288 mbufs per physical port (Adding some safety margin, a 
>>> total of 2K mbufs)
>>>
>>> ---
>>> Total : 23K mbufs
>>>
>>> This is way lower than the size of the earlier shared mempool (256K mbufs

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Ilya Maximets
On 29.01.2018 11:19, Jan Scheurich wrote:
> Hi,
>  
> I'd like to take one step back and look at how much many mbufs we actually 
> need.
>  
> Today mbufs are consumed in the following places:
> 
>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * 
> dev->requested_rxq_size
> Note 1: These mbufs are hogged up at all times.
> Note 2: There is little point in configuring more rx queues per phy port 
> than there are PMDs to poll them.
> Note 3: The rx queues of vhostuser ports exist as virtqueues in the guest 
> and do not hog mbufs.
>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
>  3. One batch per tx queue with time-based tx batching: dev->requested_n_txq 
> * NETDEV_MAX_BURST
>  4. Tx queues of **physical** ports: dev->requested_n_txq * expected peak tx 
> queue fill level
> Note 1:  The maximum of 2K mbufs per tx queue can only be reached if the 
> OVS transmit rate exceeds the line rate for a long time. This can only happen 
> for large packets and when the traffic originates from VMs on the compute 
> node. This would be a case of under-dimensioning and packets would be dropped 
> in any case. Excluding that scenario, a typical peak tx queue fill level 
> would be when all PMDs transmit a full batch at the same time: #PMDs * 
> NETDEV_MAX_BURST.

Above assumption is wrong. Just look at ixgbe driver:
drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():

   /*
* Begin scanning the H/W ring for done descriptors when the
* number of available descriptors drops below tx_free_thresh.  For
* each done descriptor, free the associated buffer.
*/
   if (txq->nb_tx_free < txq->tx_free_thresh)
   ┊   ixgbe_tx_free_bufs(txq);

The default value for 'tx_free_thresh' is 32. So, if I'll configure number
of TX descriptors to 4096, driver will start to free mbufs only when it will
have more than 4063 mbufs inside its TX queue. No matter how frequent calls
to send() function.

> Note 2: Vhostuser ports do not store mbufs in tx queues due to copying to 
> virtio descriptors
> 
>  
> For concreteness let us use an example of a typical, rather large OVS 
> deployment in an NFVI cloud:
> 
>   * Two cores with 4 PMDs per NUMA socket using HT.
>   * Two physical ports using RSS over 4 rx queues to enable load-sharing over 
> the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
>   * 100 vhostuser ports with a varying number of rx and tx queue pairs (128 
> in total).
> 
>  
> In the above example deployments this translates into
> 
>  1. 4 * 2K = 8K mbufs per physical port (16K in total)
>  2. 8 * 32 = 256 mbufs total
>  3. (128 +  23*9) * 32 = 4672 mbufs in total
>  4. 9 * 32 = 288 mbufs per physical port (Adding some safety margin, a total 
> of 2K mbufs)
> 
> ---
> Total : 23K mbufs
>  
> This is way lower than the size of the earlier shared mempool (256K mbufs), 
> which explains why we have never observed out of mbuf drops in our NFVI 
> deployments. The vswitchd crash that triggered the change to per-port 
> mempools only happened because they tried to configure 64 rx and tx queues 
> per physical port for multiple ports. I can’t see any reason for configuring 
> more rx and tx queues than polling PMDs, though.
>  
> The actual consumption of mbufs scales primarily with the number of physical 
> ports (a, c and d) and only to a much lower degree with the number of vhost 
> ports/queues (c).
>  
> Except for the phy rx queues, all other cases buffer a statistical mix of 
> mbufs received on all ports. There seems little point in assigning per-port 
> mempools for these.
>  
> I think we should revert to a shared mempool (per MTU size) with a simple 
> dimensioning formula that only depends on the number of physical ports and 
> the number of PMDs, both of which are zero day configuration parameters that 
> are set by OVS users.
>  
> For example:
> #mbuf = SUM/physical ports [n_rxq * rxq_size + (#PMDs + 1) * txq_size] + 16K
>  
> The fixed 16K would cover for b) and c) for up to 512 vhostuser tx queues, 
> which should be ample.
> In the above example this result in 2 * [ 4 * 2K + 9 * 2K ] + 8K  = 60K mbufs.
>  
> BR, Jan
>  
>> -Original Message-
>> From: Venkatesan Pradeep
>> Sent: Friday, 26 January, 2018 18:34
>> To: Jan Scheurich ; Stokes, Ian 
>> ; ovs-discuss@openvswitch.org
>> Cc: Kevin Traynor ; Flavio Leitner ; 
>> Ilya Maximets (i.maxim...@samsung.com)
>> ; Loftus, Ciara ; Kavanagh, 
>> Mark B ; Ben Pfaff
>> (b...@ovn.org) ; acon...@redhat.com
>> Subject: RE: Mempool issue for OVS 2.9
>> 
>> Response marked [Pradeep]
>> 
>> Thanks,
>> 
>> Pradeep
>> 
>> -Original Message-
>> From: Jan Scheurich
>> Sent: Friday, January 26, 2018 10:26 PM
>> To: Stokes, Ian mailto:ian.sto...@intel.com>>; 
>> ovs-discuss@openvswitch.org 
>> Cc: Kevin Traynor mailto:ktray...@redhat.com>>; Flavio 
>> Leitner mailto:f...@redhat.com>>; Ilya Maximets 
>> (i.maxim...@samsung.com 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Ilya Maximets
On 28.01.2018 21:34, Venkatesan Pradeep wrote:
> Some more observations
> 
> Regards,
> 
> Pradeep
> 
>> -Original Message-   
>> From: ovs-discuss-boun...@openvswitch.org [mailto:ovs-discuss-
>> boun...@openvswitch.org] On Behalf Of Venkatesan Pradeep
>> Sent: Friday, January 26, 2018 11:04 PM
>> To: Jan Scheurich ; Stokes, Ian
>> ; ovs-discuss@openvswitch.org
>> Cc: Ilya Maximets (i.maxim...@samsung.com) ;
>> Flavio Leitner 
>> Subject: Re: [ovs-discuss] Mempool issue for OVS 2.9
>>
>> Response marked [Pradeep]
>>
>> Thanks,
>>
>> Pradeep
>>
>> -Original Message-
>> From: Jan Scheurich
>> Sent: Friday, January 26, 2018 10:26 PM
>> To: Stokes, Ian ; ovs-discuss@openvswitch.org
>> Cc: Kevin Traynor ; Flavio Leitner ;
>> Ilya Maximets (i.maxim...@samsung.com) ;
>> Loftus, Ciara ; Kavanagh, Mark B
>> ; Ben Pfaff (b...@ovn.org) ;
>> acon...@redhat.com; Venkatesan Pradeep
>> 
>> Subject: RE: Mempool issue for OVS 2.9
>>
>>> -Original Message-
>>> From: Stokes, Ian [mailto:ian.sto...@intel.com]
>>> Sent: Friday, 26 January, 2018 13:01
>>> To: ovs-discuss@openvswitch.org
>>> Cc: Kevin Traynor ; Flavio Leitner
>>> ; Ilya Maximets (i.maxim...@samsung.com)
>>> ; Loftus, Ciara ;
>>> Kavanagh, Mark B ; Jan Scheurich
>>> ; Ben Pfaff (b...@ovn.org) ;
>>> acon...@redhat.com; Venkatesan Pradeep
>>> 
>>> Subject: Mempool issue for OVS 2.9
>>>
>>> Hi All,
>>>
>>> Recently an issue was raised regarding the move from a single shared
>>> mempool model that was in place up to OVS 2.8, to a mempool per port
>> model introduced in 2.9.
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021
>>> .html
>>>
>>> The per port mempool model was introduced in September 5th with commit
>>> d555d9bded to allow fine grain control on a per port case of memory usage.
>>>
>>> In the 'common/shared mempool' model, ports sharing a similar MTU
>>> would all share the same buffer mempool (e.g. the most common example of
>> this being that all ports are by default created with a 1500B MTU, and as 
>> such
>> share the same mbuf mempool).
>>>
>>> This approach had some drawbacks however. For example, with the shared
>>> memory pool model a user could exhaust the shared memory pool (for
>>> instance by requesting a large number of RXQs for a port), this would
>>> cause the vSwitch to crash as any remaining ports would not have the
>> required memory to function. This bug was discovered and reported to the
>> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
>> discuss/2016-September/042560.html.
>>>
>>> The per port mempool patch aimed to avoid such issues by allocating a
>> separate buffer mempool to each port.
>>>
>>> An issue has been flagged on ovs-discuss, whereby memory dimensions
>>> provided for a given number of ports on OvS 2.6-2.8 may be
>>> insufficient to support the same number of ports in OvS 2.9, on
>>> account of the per-port mempool model without re-dimensioning extra
>>> memory. The effect of this is use case dependent (number of ports, RXQs,
>> MTU settings, number of PMDs etc.) The previous 'common- pool' model was
>> rudimentary in estimating the mempool size and was marked as something
>> that was to be improved upon. The memory allocation calculation for per port
>> model was modified to take the possible configuration factors mentioned into
>> account.
>>>
>>> It's unfortunate that this came to light so close to the release code
>>> freeze - but better late than never as it is a valid problem to be resolved.
>>>
>>> I wanted to highlight some options to the community as I don't think
>>> the next steps should be taken in isolation due to the impact this feature 
>>> has.
>>>
>>> There are a number of possibilities for the 2.9 release.
>>>
>>> (i) Revert the mempool per port patches and return to the shared
>>> mempool model. There are a number of features and refactoring in place on
>> top of the change so this will not be a simple revert. I'm investigating what
>> exactly is involved with this currently.
>>
>> The shared mempool concept has been working fairly well in our NFVI systems
>> for a long time. One can argue that it is too simplistic (as it does not at 
>> all take
>

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Jan Scheurich
Hi,

I'd like to take one step back and look at how much many mbufs we actually need.

Today mbufs are consumed in the following places:
a)  Rx queues of *physical* dpdk ports: dev->requested_n_rxq * 
dev->requested_rxq_size
   Note 1: These mbufs are hogged up at all times.
   Note 2: There is little point in configuring more rx queues per phy port 
than there are PMDs to poll them.
   Note 3: The rx queues of vhostuser ports exist as virtqueues in the guest 
and do not hog mbufs.
b)  One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
c)  One batch per tx queue with time-based tx batching: 
dev->requested_n_txq * NETDEV_MAX_BURST
d)  Tx queues of *physical* ports: dev->requested_n_txq * expected peak tx 
queue fill level
Note 1:  The maximum of 2K mbufs per tx queue can only be reached if the OVS 
transmit rate exceeds the line rate for a long time. This can only happen for 
large packets and when the traffic originates from VMs on the compute node. 
This would be a case of under-dimensioning and packets would be dropped in any 
case. Excluding that scenario, a typical peak tx queue fill level would be when 
all PMDs transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
Note 2: Vhostuser ports do not store mbufs in tx queues due to copying to 
virtio descriptors

For concreteness let us use an example of a typical, rather large OVS 
deployment in an NFVI cloud:
*   Two cores with 4 PMDs per NUMA socket using HT.
*   Two physical ports using RSS over 4 rx queues to enable load-sharing 
over the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
*   100 vhostuser ports with a varying number of rx and tx queue pairs (128 
in total).

In the above example deployments this translates into
a)  4 * 2K = 8K mbufs per physical port (16K in total)
b)  8 * 32 = 256 mbufs total
c)  (128 +  23*9) * 32 = 4672 mbufs in total
d)  9 * 32 = 288 mbufs per physical port (Adding some safety margin, a 
total of 2K mbufs)
---
Total : 23K mbufs

This is way lower than the size of the earlier shared mempool (256K mbufs), 
which explains why we have never observed out of mbuf drops in our NFVI 
deployments. The vswitchd crash that triggered the change to per-port mempools 
only happened because they tried to configure 64 rx and tx queues per physical 
port for multiple ports. I can't see any reason for configuring more rx and tx 
queues than polling PMDs, though.

The actual consumption of mbufs scales primarily with the number of physical 
ports (a, c and d) and only to a much lower degree with the number of vhost 
ports/queues (c).

Except for the phy rx queues, all other cases buffer a statistical mix of mbufs 
received on all ports. There seems little point in assigning per-port mempools 
for these.

I think we should revert to a shared mempool (per MTU size) with a simple 
dimensioning formula that only depends on the number of physical ports and the 
number of PMDs, both of which are zero day configuration parameters that are 
set by OVS users.

For example:
 #mbuf = SUM/physical ports [n_rxq * rxq_size + (#PMDs + 1) * txq_size] + 16K

The fixed 16K would cover for b) and c) for up to 512 vhostuser tx queues, 
which should be ample.
In the above example this result in 2 * [ 4 * 2K + 9 * 2K ] + 8K  = 60K mbufs.

BR, Jan

> -Original Message-
> From: Venkatesan Pradeep
> Sent: Friday, 26 January, 2018 18:34
> To: Jan Scheurich ; Stokes, Ian 
> ; ovs-discuss@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner ; 
> Ilya Maximets (i.maxim...@samsung.com)
> ; Loftus, Ciara ; Kavanagh, 
> Mark B ; Ben Pfaff
> (b...@ovn.org) ; acon...@redhat.com
> Subject: RE: Mempool issue for OVS 2.9
>
> Response marked [Pradeep]
>
> Thanks,
>
> Pradeep
>
> -Original Message-
> From: Jan Scheurich
> Sent: Friday, January 26, 2018 10:26 PM
> To: Stokes, Ian mailto:ian.sto...@intel.com>>; 
> ovs-discuss@openvswitch.org
> Cc: Kevin Traynor mailto:ktray...@redhat.com>>; Flavio 
> Leitner mailto:f...@redhat.com>>; Ilya Maximets 
> (i.maxim...@samsung.com)
> mailto:i.maxim...@samsung.com>>; Loftus, Ciara 
> mailto:ciara.lof...@intel.com>>; Kavanagh, Mark B 
> mailto:mark.b.kavan...@intel.com>>; Ben Pfaff
> (b...@ovn.org) mailto:b...@ovn.org>>; 
> acon...@redhat.com; Venkatesan Pradeep 
> mailto:venkatesan.prad...@ericsson.com>>
> Subject: RE: Mempool issue for OVS 2.9
>
> > -Original Message-
> > From: Stokes, Ian [mailto:ian.sto...@intel.com]
> > Sent: Friday, 26 January, 2018 13:01
> > To: ovs-discuss@openvswitch.org
> > Cc: Kevin Traynor mailto:ktray...@redhat.com>>; Flavio 
> > Leitner
> > mailto:f...@redhat.com>>; Ilya Maximets 
> > (i.maxim...@samsung.com)
> > mailto:i.maxim...@samsung.com>>; Loftus, Ciara 
> > mailto:ciara.lof...@intel.com>>;
> > Kavanagh, Mark B 
> > mailto

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Stokes, Ian
> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Monday, January 29, 2018 8:03 AM
> To: Stokes, Ian ; Ilya Maximets
> ; ovs-discuss@openvswitch.org
> Cc: Flavio Leitner ; Loftus, Ciara
> ; Kavanagh, Mark B ;
> Jan Scheurich (jan.scheur...@ericsson.com) ;
> Ben Pfaff (b...@ovn.org) ; acon...@redhat.com; Venkatesan
> Pradeep 
> Subject: Re: Mempool issue for OVS 2.9
> 
> On 01/26/2018 05:27 PM, Stokes, Ian wrote:
> >> -Original Message-
> >> From: Kevin Traynor [mailto:ktray...@redhat.com]
> >> Sent: Friday, January 26, 2018 3:48 PM
> >> To: Ilya Maximets ; Stokes, Ian
> >> ; ovs-discuss@openvswitch.org
> >> Cc: Flavio Leitner ; Loftus, Ciara
> >> ; Kavanagh, Mark B
> >> ; Jan Scheurich
> >> (jan.scheur...@ericsson.com) ; Ben Pfaff
> >> (b...@ovn.org) ; acon...@redhat.com; Venkatesan Pradeep
> >> 
> >> Subject: Re: Mempool issue for OVS 2.9
> >>
> >> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
> >>> On 26.01.2018 15:00, Stokes, Ian wrote:
>  Hi All,
> 
>  Recently an issue was raised regarding the move from a single
>  shared
> >> mempool model that was in place up to OVS 2.8, to a mempool per port
> >> model introduced in 2.9.
> 
>  https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046
>  02
>  1.html
> 
>  The per port mempool model was introduced in September 5th with
>  commit
> >> d555d9bded to allow fine grain control on a per port case of memory
> usage.
> 
>  In the 'common/shared mempool' model, ports sharing a similar MTU
>  would
> >> all share the same buffer mempool (e.g. the most common example of
> >> this being that all ports are by default created with a 1500B MTU,
> >> and as such share the same mbuf mempool).
> 
>  This approach had some drawbacks however. For example, with the
>  shared
> >> memory pool model a user could exhaust the shared memory pool (for
> >> instance by requesting a large number of RXQs for a port), this would
> >> cause the vSwitch to crash as any remaining ports would not have the
> >> required memory to function. This bug was discovered and reported to
> >> the community in late 2016
> >> https://mail.openvswitch.org/pipermail/ovs-
> >> discuss/2016-September/042560.html.
> 
>  The per port mempool patch aimed to avoid such issues by allocating
>  a
> >> separate buffer mempool to each port.
> 
>  An issue has been flagged on ovs-discuss, whereby memory dimensions
> >> provided for a given number of ports on OvS 2.6-2.8 may be
> >> insufficient to support the same number of ports in OvS 2.9, on
> >> account of the per-port mempool model without re-dimensioning extra
> >> memory. The effect of this is use case dependent (number of ports,
> >> RXQs, MTU settings, number of PMDs
> >> etc.) The previous 'common-pool' model was rudimentary in estimating
> >> the mempool size and was marked as something that was to be improved
> >> upon. The memory allocation calculation for per port model was
> >> modified to take the possible configuration factors mentioned into
> account.
> 
>  It's unfortunate that this came to light so close to the release
>  code
> >> freeze - but better late than never as it is a valid problem to be
> >> resolved.
> 
>  I wanted to highlight some options to the community as I don't
>  think
> >> the next steps should be taken in isolation due to the impact this
> >> feature has.
> 
>  There are a number of possibilities for the 2.9 release.
> 
>  (i) Revert the mempool per port patches and return to the shared
> >> mempool model. There are a number of features and refactoring in
> >> place on top of the change so this will not be a simple revert. I'm
> >> investigating what exactly is involved with this currently.
> >>>
> >>> This looks like a bad thing to do. Shared memory pools has their own
> >>> issues and hides the real memory usage by each port. Also, reverting
> >>> seems almost impossible, we'll have to re-implement it from scratch.
> >
> > I would agree, reverting isn’t as straight forward an option due to the
> amount of commits that were introduced in relation to the per port mempool
> feature over time(listed below for completeness).
> >
> > netdev-dpdk: Create separate memory pool for each port: d555d9b
> > netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> > netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> > netdev-dpdk: skip init for existing mempools: 837c176
> > netdev-dpdk: manage failure in mempool name creation: 65056fd
> > netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
> > netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
> > netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
> > netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
> > netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
> > netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
> > netdev-dpdk: Fix mempool creation with large MTU: af5b0

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Kevin Traynor
On 01/26/2018 05:27 PM, Stokes, Ian wrote:
>> -Original Message-
>> From: Kevin Traynor [mailto:ktray...@redhat.com]
>> Sent: Friday, January 26, 2018 3:48 PM
>> To: Ilya Maximets ; Stokes, Ian
>> ; ovs-discuss@openvswitch.org
>> Cc: Flavio Leitner ; Loftus, Ciara
>> ; Kavanagh, Mark B ;
>> Jan Scheurich (jan.scheur...@ericsson.com) ;
>> Ben Pfaff (b...@ovn.org) ; acon...@redhat.com; Venkatesan
>> Pradeep 
>> Subject: Re: Mempool issue for OVS 2.9
>>
>> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>>> On 26.01.2018 15:00, Stokes, Ian wrote:
 Hi All,

 Recently an issue was raised regarding the move from a single shared
>> mempool model that was in place up to OVS 2.8, to a mempool per port model
>> introduced in 2.9.

 https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/04602
 1.html

 The per port mempool model was introduced in September 5th with commit
>> d555d9bded to allow fine grain control on a per port case of memory usage.

 In the 'common/shared mempool' model, ports sharing a similar MTU would
>> all share the same buffer mempool (e.g. the most common example of this
>> being that all ports are by default created with a 1500B MTU, and as such
>> share the same mbuf mempool).

 This approach had some drawbacks however. For example, with the shared
>> memory pool model a user could exhaust the shared memory pool (for
>> instance by requesting a large number of RXQs for a port), this would
>> cause the vSwitch to crash as any remaining ports would not have the
>> required memory to function. This bug was discovered and reported to the
>> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
>> discuss/2016-September/042560.html.

 The per port mempool patch aimed to avoid such issues by allocating a
>> separate buffer mempool to each port.

 An issue has been flagged on ovs-discuss, whereby memory dimensions
>> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to
>> support the same number of ports in OvS 2.9, on account of the per-port
>> mempool model without re-dimensioning extra memory. The effect of this is
>> use case dependent (number of ports, RXQs, MTU settings, number of PMDs
>> etc.) The previous 'common-pool' model was rudimentary in estimating the
>> mempool size and was marked as something that was to be improved upon. The
>> memory allocation calculation for per port model was modified to take the
>> possible configuration factors mentioned into account.

 It's unfortunate that this came to light so close to the release code
>> freeze - but better late than never as it is a valid problem to be
>> resolved.

 I wanted to highlight some options to the community as I don't think
>> the next steps should be taken in isolation due to the impact this feature
>> has.

 There are a number of possibilities for the 2.9 release.

 (i) Revert the mempool per port patches and return to the shared
>> mempool model. There are a number of features and refactoring in place on
>> top of the change so this will not be a simple revert. I'm investigating
>> what exactly is involved with this currently.
>>>
>>> This looks like a bad thing to do. Shared memory pools has their own
>>> issues and hides the real memory usage by each port. Also, reverting
>>> seems almost impossible, we'll have to re-implement it from scratch.
> 
> I would agree, reverting isn’t as straight forward an option due to the 
> amount of commits that were introduced in relation to the per port mempool 
> feature over time(listed below for completeness).
> 
> netdev-dpdk: Create separate memory pool for each port: d555d9b
> netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> netdev-dpdk: skip init for existing mempools: 837c176
> netdev-dpdk: manage failure in mempool name creation: 65056fd 
> netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
> netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
> netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
> netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
> netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
> netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
> netdev-dpdk: Fix mempool creation with large MTU: af5b0da
> netdev-dpdk: Add debug appctl to get mempool information: be48173
> 
> Although a lot of these are fixes/formatting, we would have to introduce a 
> new series and re-introduce shared model by removing the per port specifics. 
> This will require extensive testing also.
> 
> Functionality such as the mempool debug tool would also be impacted as well 
> (I'm not sure how valuable it would be for 1 shared mempool).
> 
> I can look at putting an RFC for this together so as to keep the option on 
> the table if it's preferred within the 2.9 time window.
> 
> (ii) Leave the per port mempool implementation as is, flag to users
>> that memory

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Kevin Traynor
On 01/26/2018 05:21 PM, Ilya Maximets wrote:
> On 26.01.2018 18:47, Kevin Traynor wrote:
>> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>>> On 26.01.2018 15:00, Stokes, Ian wrote:
 Hi All,

 Recently an issue was raised regarding the move from a single shared 
 mempool model that was in place up to OVS 2.8, to a mempool per port model 
 introduced in 2.9.

 https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html

 The per port mempool model was introduced in September 5th with commit 
 d555d9bded to allow fine grain control on a per port case of memory usage.

 In the 'common/shared mempool' model, ports sharing a similar MTU would 
 all share the same buffer mempool (e.g. the most common example of this 
 being that all ports are by default created with a 1500B MTU, and as such 
 share the same mbuf mempool).

 This approach had some drawbacks however. For example, with the shared 
 memory pool model a user could exhaust the shared memory pool (for 
 instance by requesting a large number of RXQs for a port), this would 
 cause the vSwitch to crash as any remaining ports would not have the 
 required memory to function. This bug was discovered and reported to the 
 community in late 2016 
 https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.

 The per port mempool patch aimed to avoid such issues by allocating a 
 separate buffer mempool to each port.

 An issue has been flagged on ovs-discuss, whereby memory dimensions 
 provided for a given number of ports on OvS 2.6-2.8 may be insufficient to 
 support the same number of ports in OvS 2.9, on account of the per-port 
 mempool model without re-dimensioning extra memory. The effect of this is 
 use case dependent (number of ports, RXQs, MTU settings, number of PMDs 
 etc.) The previous 'common-pool' model was rudimentary in estimating the 
 mempool size and was marked as something that was to be improved upon. The 
 memory allocation calculation for per port model was modified to take the 
 possible configuration factors mentioned into account.

 It's unfortunate that this came to light so close to the release code 
 freeze - but better late than never as it is a valid problem to be 
 resolved.

 I wanted to highlight some options to the community as I don't think the 
 next steps should be taken in isolation due to the impact this feature has.

 There are a number of possibilities for the 2.9 release.

 (i) Revert the mempool per port patches and return to the shared mempool 
 model. There are a number of features and refactoring in place on top of 
 the change so this will not be a simple revert. I'm investigating what 
 exactly is involved with this currently.
>>>
>>> This looks like a bad thing to do. Shared memory pools has their own issues 
>>> and
>>> hides the real memory usage by each port. Also, reverting seems almost 
>>> impossible,
>>> we'll have to re-implement it from scratch.
> (ii) Leave the per port mempool implementation as is, flag to users
>> that memory requirements have increased. Extra memory may have to be
>> provided on a per use case basis.
>>>
>>> That's a good point. I prefer this option if possible.
>>
>> With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports
>> per socket now. 9 ports if 1 Rxq. Of course a user may have some
>> headroom in the memory they have allocated but that can't be assumed. It
>> will surely mean some users setup will not work after upgrade - then
>> they will have to debug why.
>>
>> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4
>> rxqs, or 13 ports/1 rxqs without additional memory. It still gives
>> 24K/18K buffers per port.
>>
>> What do you think?
> 
> You're talking mostly about physical ports. But what about virtual?
> They will have only ~10K mbufs.
> 
> In my usual installation I have bonding with 2 HW NICs with 4096
> descriptors in each. --> 8192 mbufs will be consumed only by HW
> TX queues. With some fluctuations and possible sends to different
> TX queues on HW and possible sends to another HW NICs I'll have
> almost 100% probability of mempool exhausting even on that simple
> installation. But there are more complicated installations too
> (4 bonded NICs, for example).
>

Yes, thanks for pointing it out. The calculation for the amount of
buffers needed is bad, so we'll just have to leave the additional
buffers to try and cover the other cases it misses.
>>
>>>
 (iii) Reduce the amount of memory allocated per mempool per port. An RFC 
 to this effect was submitted by Kevin but on follow up the feeling is that 
 it does not resolve the issue adequately.
>>>
>>
>> Right, we discussed it in the other thread. It fixes some of the missed
>> factors for worst case, but Pradeep pointed out there 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Jan Scheurich
> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Monday, 29 January, 2018 09:35
> To: Jan Scheurich ; Venkatesan Pradeep 
> ; Stokes, Ian
> ; d...@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner ; 
> Loftus, Ciara ; Kavanagh, Mark B
> ; Ben Pfaff (b...@ovn.org) ; 
> acon...@redhat.com; disc...@openvswitch.org
> Subject: Re: Mempool issue for OVS 2.9
> 
> On 29.01.2018 11:19, Jan Scheurich wrote:
> > Hi,
> >
> > I'd like to take one step back and look at how much many mbufs we actually 
> > need.
> >
> > Today mbufs are consumed in the following places:
> >
> >  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * 
> > dev->requested_rxq_size
> > Note 1: These mbufs are hogged up at all times.
> > Note 2: There is little point in configuring more rx queues per phy 
> > port than there are PMDs to poll them.
> > Note 3: The rx queues of vhostuser ports exist as virtqueues in the 
> > guest and do not hog mbufs.
> >  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
> >  3. One batch per tx queue with time-based tx batching: 
> > dev->requested_n_txq * NETDEV_MAX_BURST
> >  4. Tx queues of **physical** ports: dev->requested_n_txq * expected peak 
> > tx queue fill level
> > Note 1:  The maximum of 2K mbufs per tx queue can only be reached if 
> > the OVS transmit rate exceeds the line rate for a long time.
> This can only happen for large packets and when the traffic originates from 
> VMs on the compute node. This would be a case of under-
> dimensioning and packets would be dropped in any case. Excluding that 
> scenario, a typical peak tx queue fill level would be when all
> PMDs transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
> 
> Above assumption is wrong. Just look at ixgbe driver:
> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
> 
>/*
> * Begin scanning the H/W ring for done descriptors when the
> * number of available descriptors drops below tx_free_thresh.  For
> * each done descriptor, free the associated buffer.
> */
>if (txq->nb_tx_free < txq->tx_free_thresh)
>┊   ixgbe_tx_free_bufs(txq);
> 
> The default value for 'tx_free_thresh' is 32. So, if I'll configure number
> of TX descriptors to 4096, driver will start to free mbufs only when it will
> have more than 4063 mbufs inside its TX queue. No matter how frequent calls
> to send() function.

OK, but that doesn't change my general argument. The mbufs hogged in the tx 
side of the phy port driver are coming from all ports (least likely the port 
itself). Considering them in dimensioning the port's private mempool is 
conceptually wrong. In my simplified dimensioning formula below I have already 
assumed full occupancy of the tx queue for phy ports. The second key 
observation is that vhostuser ports do not hog mbufs at all. And vhost zero 
copy doesn't change that.

BTW:, is there any reason why phy drivers should free tx mbufs only when the tx 
ring is close to becoming full? I'd understand that want need to free them in 
batches for performance reasons, but is there no cheap possibility to do this 
earlier?

> 
> > Note 2: Vhostuser ports do not store mbufs in tx queues due to copying 
> > to virtio descriptors
> >
> >
> > For concreteness let us use an example of a typical, rather large OVS 
> > deployment in an NFVI cloud:
> >
> >   * Two cores with 4 PMDs per NUMA socket using HT.
> >   * Two physical ports using RSS over 4 rx queues to enable load-sharing 
> > over the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
> >   * 100 vhostuser ports with a varying number of rx and tx queue pairs (128 
> > in total).
> >
> >
> > In the above example deployments this translates into
> >
> >  1. 4 * 2K = 8K mbufs per physical port (16K in total)
> >  2. 8 * 32 = 256 mbufs total
> >  3. (128 +  23*9) * 32 = 4672 mbufs in total
> >  4. 9 * 32 = 288 mbufs per physical port (Adding some safety margin, a 
> > total of 2K mbufs)
> >
> > ---
> > Total : 23K mbufs
> >
> > This is way lower than the size of the earlier shared mempool (256K mbufs), 
> > which explains why we have never observed out of mbuf
> drops in our NFVI deployments. The vswitchd crash that triggered the change 
> to per-port mempools only happened because they tried to
> configure 64 rx and tx queues per physical port for multiple ports. I can’t 
> see any reason for configuring more rx and tx queues than
> polling PMDs, though.
> >
> > The actual consumption of mbufs scales primarily with the number of 
> > physical ports (a, c and d) and only to a much lower degree with
> the number of vhost ports/queues (c).
> >
> > Except for the phy rx queues, all other cases buffer a statistical mix of 
> > mbufs received on all ports. There seems little point in assigning
> per-port mempools for these.
> >
> > I think we should revert to a shared mempool (per MTU size) with a simple 
> > dimensioning formula that only dep

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-04-10 Thread Stokes, Ian
> >> -Original Message-
> >> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> >> Sent: Monday, 29 January, 2018 09:35
> >> To: Jan Scheurich ; Venkatesan Pradeep
> >> ; Stokes, Ian
> >> ; d...@openvswitch.org
> >> Cc: Kevin Traynor ; Flavio Leitner
> >> ; Loftus, Ciara ; Kavanagh,
> >> Mark B ; Ben Pfaff (b...@ovn.org)
> >> ; acon...@redhat.com; disc...@openvswitch.org
> >> Subject: Re: Mempool issue for OVS 2.9
> >>
> >> On 29.01.2018 11:19, Jan Scheurich wrote:
> >>> Hi,
> >>>
> >>> I'd like to take one step back and look at how much many mbufs we
> actually need.
> >>>
> >>> Today mbufs are consumed in the following places:
> >>>
> >>>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * dev-
> >requested_rxq_size
> >>> Note 1: These mbufs are hogged up at all times.
> >>> Note 2: There is little point in configuring more rx queues per
> phy port than there are PMDs to poll them.
> >>> Note 3: The rx queues of vhostuser ports exist as virtqueues in
> the guest and do not hog mbufs.
> >>>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST  3.
> >>> One batch per tx queue with time-based tx batching:
> >>> dev->requested_n_txq * NETDEV_MAX_BURST  4. Tx queues of **physical**
> ports: dev->requested_n_txq * expected peak tx queue fill level
> >>> Note 1:  The maximum of 2K mbufs per tx queue can only be reached
> if the OVS transmit rate exceeds the line rate for a long time.
> >> This can only happen for large packets and when the traffic
> >> originates from VMs on the compute node. This would be a case of
> >> under- dimensioning and packets would be dropped in any case. Excluding
> that scenario, a typical peak tx queue fill level would be when all PMDs
> transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
> >>
> >> Above assumption is wrong. Just look at ixgbe driver:
> >> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
> >>
> >>/*
> >> * Begin scanning the H/W ring for done descriptors when the
> >> * number of available descriptors drops below tx_free_thresh.
> For
> >> * each done descriptor, free the associated buffer.
> >> */
> >>if (txq->nb_tx_free < txq->tx_free_thresh)
> >>┊   ixgbe_tx_free_bufs(txq);
> >>
> >> The default value for 'tx_free_thresh' is 32. So, if I'll configure
> >> number of TX descriptors to 4096, driver will start to free mbufs
> >> only when it will have more than 4063 mbufs inside its TX queue. No
> >> matter how frequent calls to send() function.
> >
> > OK, but that doesn't change my general argument. The mbufs hogged in the
> tx side of the phy port driver are coming from all ports (least likely the
> port itself). Considering them in dimensioning the port's private mempool
> is conceptually wrong. In my simplified dimensioning formula below I have
> already assumed full occupancy of the tx queue for phy ports. The second
> key observation is that vhostuser ports do not hog mbufs at all. And vhost
> zero copy doesn't change that.
> 
> Formula below maybe good for static environment. I want to change number
> of PMD threads dynamically in my deployments and this working in current
> per-port model and with oversized shared pool. If we'll try to reduce
> memory consumption of the shared pool we'll have to reconfigure all the
> devices each time we change the number of PMD threads. This would be
> really bad.
> So, size of the memory pool should not depend on dynamic characteristics
> of the datapath or other ports to avoid unexpected interrupts in traffic
> flows in case of random changes in configuration. Of course, it could
> depend on characteristics of the port itself in case of per-port model. In
> case of shared mempool model the size should only depend on static
> datapath configuration.

Hi all,

Now seems a good time to kick start this conversation again as there's a few 
patches floating around for mempools on master and 2.9.
I'm happy to work on a solution for this but before starting I'd like to agree 
on the requirements so we're all comfortable with the solution.

I see two use cases above, static and dynamic. Each have their own requirements 
(I'm keeping OVS 2.10 in mind here as it's an issue we need to resolve).

Static environment
1. For a given deployment, the 2.10 the mempool design should use the same or 
less memory as the shared mempool design of 2.9.
2. Memory pool size can depend on static datapath configurations, but the 
previous provisioning used in OVS 2.9 is acceptable also.

I think the shared mempool model suits the static environment, it's a rough way 
of provisioning memory but it works for the majority involved in the discussion 
to date.

Dynamic environment
1. Mempool size should not depend on dynamic characteristics (number of PMDs, 
number of ports etc.), this leads to frequent traffic interrupts.
2. Due to the dynamic environment, it's preferable for clear visibility of 
memory usage for ports (Sharing me

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-04-11 Thread Kevin Traynor
On 04/10/2018 11:12 AM, Stokes, Ian wrote:
 -Original Message-
 From: Ilya Maximets [mailto:i.maxim...@samsung.com]
 Sent: Monday, 29 January, 2018 09:35
 To: Jan Scheurich ; Venkatesan Pradeep
 ; Stokes, Ian
 ; d...@openvswitch.org
 Cc: Kevin Traynor ; Flavio Leitner
 ; Loftus, Ciara ; Kavanagh,
 Mark B ; Ben Pfaff (b...@ovn.org)
 ; acon...@redhat.com; disc...@openvswitch.org
 Subject: Re: Mempool issue for OVS 2.9

 On 29.01.2018 11:19, Jan Scheurich wrote:
> Hi,
>
> I'd like to take one step back and look at how much many mbufs we
>> actually need.
>
> Today mbufs are consumed in the following places:
>
>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * dev-
>>> requested_rxq_size
> Note 1: These mbufs are hogged up at all times.
> Note 2: There is little point in configuring more rx queues per
>> phy port than there are PMDs to poll them.
> Note 3: The rx queues of vhostuser ports exist as virtqueues in
>> the guest and do not hog mbufs.
>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST  3.
> One batch per tx queue with time-based tx batching:
> dev->requested_n_txq * NETDEV_MAX_BURST  4. Tx queues of **physical**
>> ports: dev->requested_n_txq * expected peak tx queue fill level
> Note 1:  The maximum of 2K mbufs per tx queue can only be reached
>> if the OVS transmit rate exceeds the line rate for a long time.
 This can only happen for large packets and when the traffic
 originates from VMs on the compute node. This would be a case of
 under- dimensioning and packets would be dropped in any case. Excluding
>> that scenario, a typical peak tx queue fill level would be when all PMDs
>> transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.

 Above assumption is wrong. Just look at ixgbe driver:
 drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():

/*
 * Begin scanning the H/W ring for done descriptors when the
 * number of available descriptors drops below tx_free_thresh.
>> For
 * each done descriptor, free the associated buffer.
 */
if (txq->nb_tx_free < txq->tx_free_thresh)
┊   ixgbe_tx_free_bufs(txq);

 The default value for 'tx_free_thresh' is 32. So, if I'll configure
 number of TX descriptors to 4096, driver will start to free mbufs
 only when it will have more than 4063 mbufs inside its TX queue. No
 matter how frequent calls to send() function.
>>>
>>> OK, but that doesn't change my general argument. The mbufs hogged in the
>> tx side of the phy port driver are coming from all ports (least likely the
>> port itself). Considering them in dimensioning the port's private mempool
>> is conceptually wrong. In my simplified dimensioning formula below I have
>> already assumed full occupancy of the tx queue for phy ports. The second
>> key observation is that vhostuser ports do not hog mbufs at all. And vhost
>> zero copy doesn't change that.
>>
>> Formula below maybe good for static environment. I want to change number
>> of PMD threads dynamically in my deployments and this working in current
>> per-port model and with oversized shared pool. If we'll try to reduce
>> memory consumption of the shared pool we'll have to reconfigure all the
>> devices each time we change the number of PMD threads. This would be
>> really bad.
>> So, size of the memory pool should not depend on dynamic characteristics
>> of the datapath or other ports to avoid unexpected interrupts in traffic
>> flows in case of random changes in configuration. Of course, it could
>> depend on characteristics of the port itself in case of per-port model. In
>> case of shared mempool model the size should only depend on static
>> datapath configuration.
> 
> Hi all,
> 
> Now seems a good time to kick start this conversation again as there's a few 
> patches floating around for mempools on master and 2.9.
> I'm happy to work on a solution for this but before starting I'd like to 
> agree on the requirements so we're all comfortable with the solution.
> 

Thanks for kicking it off Ian. FWIW, the freeing fix code can work with
both schemes below. I already have that between the patches for
different branches. It should be straightforward to change to cover both
in same code. I can help with that if needed.

> I see two use cases above, static and dynamic. Each have their own 
> requirements (I'm keeping OVS 2.10 in mind here as it's an issue we need to 
> resolve).
> 
> Static environment
> 1. For a given deployment, the 2.10 the mempool design should use the same or 
> less memory as the shared mempool design of 2.9.
> 2. Memory pool size can depend on static datapath configurations, but the 
> previous provisioning used in OVS 2.9 is acceptable also.
> 
> I think the shared mempool model suits the static environment, it's a rough 
> wa

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-04-12 Thread Ilya Maximets
On 11.04.2018 20:55, Kevin Traynor wrote:
> On 04/10/2018 11:12 AM, Stokes, Ian wrote:
> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Monday, 29 January, 2018 09:35
> To: Jan Scheurich ; Venkatesan Pradeep
> ; Stokes, Ian
> ; d...@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner
> ; Loftus, Ciara ; Kavanagh,
> Mark B ; Ben Pfaff (b...@ovn.org)
> ; acon...@redhat.com; disc...@openvswitch.org
> Subject: Re: Mempool issue for OVS 2.9
>
> On 29.01.2018 11:19, Jan Scheurich wrote:
>> Hi,
>>
>> I'd like to take one step back and look at how much many mbufs we
>>> actually need.
>>
>> Today mbufs are consumed in the following places:
>>
>>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * dev-
 requested_rxq_size
>> Note 1: These mbufs are hogged up at all times.
>> Note 2: There is little point in configuring more rx queues per
>>> phy port than there are PMDs to poll them.
>> Note 3: The rx queues of vhostuser ports exist as virtqueues in
>>> the guest and do not hog mbufs.
>>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST  3.
>> One batch per tx queue with time-based tx batching:
>> dev->requested_n_txq * NETDEV_MAX_BURST  4. Tx queues of **physical**
>>> ports: dev->requested_n_txq * expected peak tx queue fill level
>> Note 1:  The maximum of 2K mbufs per tx queue can only be reached
>>> if the OVS transmit rate exceeds the line rate for a long time.
> This can only happen for large packets and when the traffic
> originates from VMs on the compute node. This would be a case of
> under- dimensioning and packets would be dropped in any case. Excluding
>>> that scenario, a typical peak tx queue fill level would be when all PMDs
>>> transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
>
> Above assumption is wrong. Just look at ixgbe driver:
> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
>
>/*
> * Begin scanning the H/W ring for done descriptors when the
> * number of available descriptors drops below tx_free_thresh.
>>> For
> * each done descriptor, free the associated buffer.
> */
>if (txq->nb_tx_free < txq->tx_free_thresh)
>┊   ixgbe_tx_free_bufs(txq);
>
> The default value for 'tx_free_thresh' is 32. So, if I'll configure
> number of TX descriptors to 4096, driver will start to free mbufs
> only when it will have more than 4063 mbufs inside its TX queue. No
> matter how frequent calls to send() function.

 OK, but that doesn't change my general argument. The mbufs hogged in the
>>> tx side of the phy port driver are coming from all ports (least likely the
>>> port itself). Considering them in dimensioning the port's private mempool
>>> is conceptually wrong. In my simplified dimensioning formula below I have
>>> already assumed full occupancy of the tx queue for phy ports. The second
>>> key observation is that vhostuser ports do not hog mbufs at all. And vhost
>>> zero copy doesn't change that.
>>>
>>> Formula below maybe good for static environment. I want to change number
>>> of PMD threads dynamically in my deployments and this working in current
>>> per-port model and with oversized shared pool. If we'll try to reduce
>>> memory consumption of the shared pool we'll have to reconfigure all the
>>> devices each time we change the number of PMD threads. This would be
>>> really bad.
>>> So, size of the memory pool should not depend on dynamic characteristics
>>> of the datapath or other ports to avoid unexpected interrupts in traffic
>>> flows in case of random changes in configuration. Of course, it could
>>> depend on characteristics of the port itself in case of per-port model. In
>>> case of shared mempool model the size should only depend on static
>>> datapath configuration.
>>
>> Hi all,
>>
>> Now seems a good time to kick start this conversation again as there's a few 
>> patches floating around for mempools on master and 2.9.
>> I'm happy to work on a solution for this but before starting I'd like to 
>> agree on the requirements so we're all comfortable with the solution.
>>
> 
> Thanks for kicking it off Ian. FWIW, the freeing fix code can work with
> both schemes below. I already have that between the patches for
> different branches. It should be straightforward to change to cover both
> in same code. I can help with that if needed.

Agree, there is no much difference between mempool models for freeing fix.

> 
>> I see two use cases above, static and dynamic. Each have their own 
>> requirements (I'm keeping OVS 2.10 in mind here as it's an issue we need to 
>> resolve).
>>
>> Static environment
>> 1. For a given deployment, the 2.10 the mempool design should use the same 
>> or less memory as the shared mempool design of 2.9.
>> 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-04-15 Thread Venkatesan Pradeep
Hi,

> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Thursday, April 12, 2018 8:52 PM
> To: Kevin Traynor ; Stokes, Ian ;
> Jan Scheurich ; Venkatesan Pradeep
> ; d...@openvswitch.org
> Cc: Flavio Leitner ; Loftus, Ciara ;
> Kavanagh, Mark B ; Ben Pfaff (b...@ovn.org)
> ; acon...@redhat.com; disc...@openvswitch.org
> Subject: Re: Mempool issue for OVS 2.9
> 
> On 11.04.2018 20:55, Kevin Traynor wrote:
> > On 04/10/2018 11:12 AM, Stokes, Ian wrote:
> > -Original Message-
> > From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> > Sent: Monday, 29 January, 2018 09:35
> > To: Jan Scheurich ; Venkatesan Pradeep
> > ; Stokes, Ian
> > ; d...@openvswitch.org
> > Cc: Kevin Traynor ; Flavio Leitner
> > ; Loftus, Ciara ;
> > Kavanagh, Mark B ; Ben Pfaff
> > (b...@ovn.org) ; acon...@redhat.com;
> > disc...@openvswitch.org
> > Subject: Re: Mempool issue for OVS 2.9
> >
> > On 29.01.2018 11:19, Jan Scheurich wrote:
> >> Hi,
> >>
> >> I'd like to take one step back and look at how much many mbufs we
> >>> actually need.
> >>
> >> Today mbufs are consumed in the following places:
> >>
> >>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq *
> >> dev-
>  requested_rxq_size
> >> Note 1: These mbufs are hogged up at all times.
> >> Note 2: There is little point in configuring more rx queues
> >> per
> >>> phy port than there are PMDs to poll them.
> >> Note 3: The rx queues of vhostuser ports exist as virtqueues
> >> in
> >>> the guest and do not hog mbufs.
> >>  2. One batch per PMD during processing: #PMD *
> NETDEV_MAX_BURST  3.
> >> One batch per tx queue with time-based tx batching:
> >> dev->requested_n_txq * NETDEV_MAX_BURST  4. Tx queues of
> >> dev->**physical**
> >>> ports: dev->requested_n_txq * expected peak tx queue fill level
> >> Note 1:  The maximum of 2K mbufs per tx queue can only be
> >> reached
> >>> if the OVS transmit rate exceeds the line rate for a long time.
> > This can only happen for large packets and when the traffic
> > originates from VMs on the compute node. This would be a case of
> > under- dimensioning and packets would be dropped in any case.
> > Excluding
> >>> that scenario, a typical peak tx queue fill level would be when all
> >>> PMDs transmit a full batch at the same time: #PMDs *
> NETDEV_MAX_BURST.
> >
> > Above assumption is wrong. Just look at ixgbe driver:
> > drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
> >
> >/*
> > * Begin scanning the H/W ring for done descriptors when the
> > * number of available descriptors drops below tx_free_thresh.
> >>> For
> > * each done descriptor, free the associated buffer.
> > */
> >if (txq->nb_tx_free < txq->tx_free_thresh)
> >┊   ixgbe_tx_free_bufs(txq);
> >
> > The default value for 'tx_free_thresh' is 32. So, if I'll
> > configure number of TX descriptors to 4096, driver will start to
> > free mbufs only when it will have more than 4063 mbufs inside its
> > TX queue. No matter how frequent calls to send() function.
> 
>  OK, but that doesn't change my general argument. The mbufs hogged
>  in the
> >>> tx side of the phy port driver are coming from all ports (least
> >>> likely the port itself). Considering them in dimensioning the port's
> >>> private mempool is conceptually wrong. In my simplified dimensioning
> >>> formula below I have already assumed full occupancy of the tx queue
> >>> for phy ports. The second key observation is that vhostuser ports do
> >>> not hog mbufs at all. And vhost zero copy doesn't change that.
> >>>
> >>> Formula below maybe good for static environment. I want to change
> >>> number of PMD threads dynamically in my deployments and this working
> >>> in current per-port model and with oversized shared pool. If we'll
> >>> try to reduce memory consumption of the shared pool we'll have to
> >>> reconfigure all the devices each time we change the number of PMD
> >>> threads. This would be really bad.
> >>> So, size of the memory pool should not depend on dynamic
> >>> characteristics of the datapath or other ports to avoid unexpected
> >>> interrupts in traffic flows in case of random changes in
> >>> configuration. Of course, it could depend on characteristics of the
> >>> port itself in case of per-port model. In case of shared mempool
> >>> model the size should only depend on static datapath configuration.
> >>
> >> Hi all,
> >>
> >> Now seems a good time to kick start this conversation again as there's a 
> >> few
> patches floating around for mempools on master and 2.9.
> >> I'm happy to work on a solution for this but before starting I'd like to 
> >> agree
> on the requirements so we're all comfortable with the solution.
> >>
> >
> > Thanks