Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2021-02-25 Thread Marcos - Mgiga
Hi Elias,

Thank you ! Actually I've build from binaries. Could you please provide me the 
link of 20.05 repository?

About the line "Note that this can help with one specific kind of packet drops 
in VPP NAT called "congestion drops"" would mind give me instructions on how  
troubleshooting VPP properly in order to find out what's going on my scenario?

Best Regards

MARCOS

-Mensagem original-
De: vpp-dev@lists.fd.io  Em nome de Elias Rudberg
Enviada em: quinta-feira, 25 de fevereiro de 2021 05:41
Para: mar...@mgiga.com.br; ksek...@cisco.com
Cc: vpp-dev@lists.fd.io
Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
NAT_FQ_NELTS to avoid congestion drops?

Hi Marcos,

If you are building VPP 20.05 from source then the easiest way is to simply 
change the value at "#define NAT_FQ_NELTS 64"
in src/plugins/nat/nat.h from 64 to something larger, we have been using 512 
which seems to work fine in our case.

Note that this can help with one specific kind of packet drops in VPP NAT 
called "congestion drops", if you have packet loss for other reasons then a 
NAT_FQ_NELTS change will probably not help.

Best regards,
Elias


On Wed, 2021-02-24 at 13:45 -0300, Marcos - Mgiga wrote:
> Hi Elias,
> 
> I have been following this discussion and finally I gave VPP a try 
> implementing it as a CGN gateway. Unfortunattely some issues came up, 
> like packets loss and I believe your patch can be helpful,
> 
> Would mind give me guidance to deploy it? I'm using VPP 20.05 as you 
> did
> 
> Best Regards
> 
> -Mensagem original-
> De: vpp-dev@lists.fd.io  Em nome de Elias Rudberg 
> Enviada em: terça-feira, 26 de janeiro de 2021 11:10
> Para: ksek...@cisco.com
> Cc: vpp-dev@lists.fd.io
> Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
> NAT_FQ_NELTS to avoid congestion drops?
> 
> Hi Klement,
> 
> > > I see no reason why this shouldn’t be configurable.
> > > [...]
> > > Would you like to submit a patch?
> 
> I had a patch in December that was lying around too long so there were 
> merge conflicts, so now I made a new one again. Third time's the 
> charm, I hope. Here it is:
> 
> https://gerrit.fd.io/r/c/vpp/+/30933
> 
> It makes the frame queue size configurable and also adds API support 
> and a test verifying the API support. Please have a look!
> 
> / Elias



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18804): https://lists.fd.io/g/vpp-dev/message/18804
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2021-02-25 Thread Elias Rudberg
Hi Marcos,

If you are building VPP 20.05 from source then the easiest way is to
simply change the value at "#define NAT_FQ_NELTS 64"
in src/plugins/nat/nat.h from 64 to something larger, we have been
using 512 which seems to work fine in our case.

Note that this can help with one specific kind of packet drops in VPP
NAT called "congestion drops", if you have packet loss for other
reasons then a NAT_FQ_NELTS change will probably not help.

Best regards,
Elias


On Wed, 2021-02-24 at 13:45 -0300, Marcos - Mgiga wrote:
> Hi Elias, 
> 
> I have been following this discussion and finally I gave VPP a try
> implementing it as a CGN gateway. Unfortunattely some issues came up,
> like packets loss and I believe your patch can be helpful,
> 
> Would mind give me guidance to deploy it? I'm using VPP 20.05 as you
> did
> 
> Best Regards
> 
> -Mensagem original-
> De: vpp-dev@lists.fd.io  Em nome de Elias
> Rudberg
> Enviada em: terça-feira, 26 de janeiro de 2021 11:10
> Para: ksek...@cisco.com
> Cc: vpp-dev@lists.fd.io
> Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size
> NAT_FQ_NELTS to avoid congestion drops?
> 
> Hi Klement,
> 
> > > I see no reason why this shouldn’t be configurable.
> > > [...]
> > > Would you like to submit a patch?
> 
> I had a patch in December that was lying around too long so there
> were merge conflicts, so now I made a new one again. Third time's the
> charm, I hope. Here it is:
> 
> https://gerrit.fd.io/r/c/vpp/+/30933
> 
> It makes the frame queue size configurable and also adds API support
> and a test verifying the API support. Please have a look!
> 
> / Elias


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18802): https://lists.fd.io/g/vpp-dev/message/18802
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2021-02-24 Thread Marcos - Mgiga
Hi Elias, 

I have been following this discussion and finally I gave VPP a try implementing 
it as a CGN gateway. Unfortunattely some issues came up, like packets loss and 
I believe your patch can be helpful,

Would mind give me guidance to deploy it? I'm using VPP 20.05 as you did

Best Regards

-Mensagem original-
De: vpp-dev@lists.fd.io  Em nome de Elias Rudberg
Enviada em: terça-feira, 26 de janeiro de 2021 11:10
Para: ksek...@cisco.com
Cc: vpp-dev@lists.fd.io
Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
NAT_FQ_NELTS to avoid congestion drops?

Hi Klement,

> > I see no reason why this shouldn’t be configurable.
> > [...]
> > Would you like to submit a patch?

I had a patch in December that was lying around too long so there were merge 
conflicts, so now I made a new one again. Third time's the charm, I hope. Here 
it is:

https://gerrit.fd.io/r/c/vpp/+/30933

It makes the frame queue size configurable and also adds API support and a test 
verifying the API support. Please have a look!

/ Elias



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18798): https://lists.fd.io/g/vpp-dev/message/18798
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2021-01-26 Thread Elias Rudberg
Hi Klement,

> > I see no reason why this shouldn’t be configurable.
> > [...]
> > Would you like to submit a patch?

I had a patch in December that was lying around too long so there were
merge conflicts, so now I made a new one again. Third time's the charm,
I hope. Here it is:

https://gerrit.fd.io/r/c/vpp/+/30933

It makes the frame queue size configurable and also adds API support
and a test verifying the API support. Please have a look!

/ Elias


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18596): https://lists.fd.io/g/vpp-dev/message/18596
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-12-21 Thread Elias Rudberg
Hi Klement,

> > > I see no reason why this shouldn’t be configurable.
> > > [...]
> > > Would you like to submit a patch?
> 
> Here is a patch making it configurable: 
> [...]

New patch, including API support and a test case: 
https://gerrit.fd.io/r/c/vpp/+/30482

Please check that one instead, I think it's better.

Best regards,
Elias


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18403): https://lists.fd.io/g/vpp-dev/message/18403
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-12-15 Thread Elias Rudberg
Hi Klement,

> > I see no reason why this shouldn’t be configurable.
> > [...]
> > Would you like to submit a patch?

Here is a patch making it configurable: 
https://gerrit.fd.io/r/c/vpp/+/30433

Best regards,
Elias


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18349): https://lists.fd.io/g/vpp-dev/message/18349
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-11-17 Thread Elias Rudberg
Hi Klement,

> I see no reason why this shouldn’t be configurable.
> [...]
> Would you like to submit a patch?

Sure, I'll give that a try, adding it as a config option of the same
kind as other NAT options.

Best regards,
Elias


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18061): https://lists.fd.io/g/vpp-dev/message/18061
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-11-16 Thread Klement Sekera via lists.fd.io
Hi Elias,

thanks for getting back with some real numbers. I only tested with two workers 
and a very simple case and in my case, increasing queue size didn’t help one 
bit. But again, in my case there was 100% handoff rate (every single packet was 
going through handoff), which is most probably the reason why one solution 
seemed like holy grail and the other useless.

To answer your question regarding why queue length is 64 - I guess nobody knows 
as the author of that code has been gone for a while. I see no reason why this 
shouldn’t be configurable. When I tried just increasing the value I quickly run 
into out-of-buffers situation with default configs.

Would you like to submit a patch?

Thanks,
Klement

> On 16 Nov 2020, at 11:33, Elias Rudberg  wrote:
> 
> Hi Klement,
> 
> Thanks! I have now tested your patch (28980), it seems to work and it
> does give some improvement. However, according to my tests, increasing
> NAT_FQ_NELTS seems to have a bigger effect, it improves performance a
> lot. When using the original NAT_FQ_NELTS value of 64, your patch
> gives some improvement but I still get the best performance when
> increasing NAT_FQ_NELTS.
> 
> For example, one of the tests behaves like this:
> 
> Without patch, NAT_FQ_NELTS=64  --> 129 Gbit/s and ~600k cong. drops
> With patch, NAT_FQ_NELTS=64  --> 136 Gbit/s and ~400k cong. drops
> Without patch, NAT_FQ_NELTS=1024  --> 151 Gbit/s and 0 cong. drops
> With patch, NAT_FQ_NELTS=1024  --> 151 Gbit/s and 0 cong. drops
> 
> So it still looks like increasing NAT_FQ_NELTS would be good, which
> brings me back to the same questions as before:
> 
> Were there specific reasons for setting NAT_FQ_NELTS to 64?
> 
> Are there some potential drawbacks or dangers of changing it to a
> larger value?
> 
> I suppose everyone will agree that when there is a queue with a
> maximum length, the choice of that maximum length can be important. Is
> there some particular reason to believe that 64 would be enough? In
> our case we are using 8 NAT threads. Suppose thread 8 is held up
> briefly due to something taking a little longer than usual, meanwhile
> threads 1-7 each hand off 10 frames to thread 8, that situation would
> require a queue size of at least 70, unless I misunderstood how the
> handoff mechanism works. To me, allowing a longer queue seems like a
> good thing because it allows us to handle also more difficult cases
> when threads are not always equally fast, there can be spikes in
> traffic that affect some threads more than others, things like
> that. But maybe there are strong reasons for keeping the queue short,
> reasons I don't know about, that's why I'm asking.
> 
> Best regards,
> Elias
> 
> 
> On Fri, 2020-11-13 at 15:14 +, Klement Sekera -X (ksekera -
> PANTHEON TECH SRO at Cisco) wrote:
>> Hi Elias,
>> 
>> I’ve already debugged this and came to the conclusion that it’s the
>> infra which is the weak link. I was seeing congestion drops at mild
>> load, but not at full load. Issue is that with handoff, there is
>> uneven workload. For simplicity’s sake, just consider thread 1
>> handing off all the traffic to thread 2. What happens is that for
>> thread 1, the job is much easier, it just does some ip4 parsing and
>> then hands packet to thread 2, which actually does the heavy lifting
>> of hash inserts/lookups/translation etc. 64 element queue can hold 64
>> frames, one extreme is 64 1-packet frames, totalling 64 packets,
>> other extreme is 64 255-packet frames, totalling ~16k packets. What
>> happens is this: thread 1 is mostly idle and just picking a few
>> packets from NIC and every one of these small frames creates an entry
>> in the handoff queue. Now thread 2 picks one element from the handoff
>> queue and deals with it before picking another one. If the queue has
>> only 3-packet or 10-packet elements, then thread 2 can never really
>> get into what VPP excels in - bulk processing.
>> 
>> Q: Why doesn’t it pick as many packets as possible from the handoff
>> queue? 
>> A: It’s not implemented.
>> 
>> I already wrote a patch for it, which made all congestion drops which
>> I saw (in above synthetic test case) disappear. Mentioned patch 
>> https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.
>> 
>> Would you like to give it a try and see if it helps your issue? We
>> shouldn’t need big queues under mild loads anyway …
>> 
>> Regards,
>> Klement
>> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18040): https://lists.fd.io/g/vpp-dev/message/18040
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-11-16 Thread Elias Rudberg
Hi Klement,

Thanks! I have now tested your patch (28980), it seems to work and it
does give some improvement. However, according to my tests, increasing
NAT_FQ_NELTS seems to have a bigger effect, it improves performance a
lot. When using the original NAT_FQ_NELTS value of 64, your patch
gives some improvement but I still get the best performance when
increasing NAT_FQ_NELTS.

For example, one of the tests behaves like this:

Without patch, NAT_FQ_NELTS=64  --> 129 Gbit/s and ~600k cong. drops
With patch, NAT_FQ_NELTS=64  --> 136 Gbit/s and ~400k cong. drops
Without patch, NAT_FQ_NELTS=1024  --> 151 Gbit/s and 0 cong. drops
With patch, NAT_FQ_NELTS=1024  --> 151 Gbit/s and 0 cong. drops

So it still looks like increasing NAT_FQ_NELTS would be good, which
brings me back to the same questions as before:

Were there specific reasons for setting NAT_FQ_NELTS to 64?

Are there some potential drawbacks or dangers of changing it to a
larger value?

I suppose everyone will agree that when there is a queue with a
maximum length, the choice of that maximum length can be important. Is
there some particular reason to believe that 64 would be enough? In
our case we are using 8 NAT threads. Suppose thread 8 is held up
briefly due to something taking a little longer than usual, meanwhile
threads 1-7 each hand off 10 frames to thread 8, that situation would
require a queue size of at least 70, unless I misunderstood how the
handoff mechanism works. To me, allowing a longer queue seems like a
good thing because it allows us to handle also more difficult cases
when threads are not always equally fast, there can be spikes in
traffic that affect some threads more than others, things like
that. But maybe there are strong reasons for keeping the queue short,
reasons I don't know about, that's why I'm asking.

Best regards,
Elias


On Fri, 2020-11-13 at 15:14 +, Klement Sekera -X (ksekera -
PANTHEON TECH SRO at Cisco) wrote:
> Hi Elias,
> 
> I’ve already debugged this and came to the conclusion that it’s the
> infra which is the weak link. I was seeing congestion drops at mild
> load, but not at full load. Issue is that with handoff, there is
> uneven workload. For simplicity’s sake, just consider thread 1
> handing off all the traffic to thread 2. What happens is that for
> thread 1, the job is much easier, it just does some ip4 parsing and
> then hands packet to thread 2, which actually does the heavy lifting
> of hash inserts/lookups/translation etc. 64 element queue can hold 64
> frames, one extreme is 64 1-packet frames, totalling 64 packets,
> other extreme is 64 255-packet frames, totalling ~16k packets. What
> happens is this: thread 1 is mostly idle and just picking a few
> packets from NIC and every one of these small frames creates an entry
> in the handoff queue. Now thread 2 picks one element from the handoff
> queue and deals with it before picking another one. If the queue has
> only 3-packet or 10-packet elements, then thread 2 can never really
> get into what VPP excels in - bulk processing.
> 
> Q: Why doesn’t it pick as many packets as possible from the handoff
> queue? 
> A: It’s not implemented.
> 
> I already wrote a patch for it, which made all congestion drops which
> I saw (in above synthetic test case) disappear. Mentioned patch 
> https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.
> 
> Would you like to give it a try and see if it helps your issue? We
> shouldn’t need big queues under mild loads anyway …
> 
> Regards,
> Klement
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18039): https://lists.fd.io/g/vpp-dev/message/18039
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-11-13 Thread Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via lists.fd.io
>> Would you consider changing to a larger value in the official VPP code?

Maybe make it configurable?
I mean, after 28980 is merged and you still find tweaking the value helpful.

Vratko.

-Original Message-
From: vpp-dev@lists.fd.io  On Behalf Of Klement Sekera via 
lists.fd.io
Sent: Friday, 2020-November-13 16:15
To: Elias Rudberg 
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
NAT_FQ_NELTS to avoid congestion drops?

Hi Elias,

I’ve already debugged this and came to the conclusion that it’s the infra which 
is the weak link. I was seeing congestion drops at mild load, but not at full 
load. Issue is that with handoff, there is uneven workload. For simplicity’s 
sake, just consider thread 1 handing off all the traffic to thread 2. What 
happens is that for thread 1, the job is much easier, it just does some ip4 
parsing and then hands packet to thread 2, which actually does the heavy 
lifting of hash inserts/lookups/translation etc. 64 element queue can hold 64 
frames, one extreme is 64 1-packet frames, totalling 64 packets, other extreme 
is 64 255-packet frames, totalling ~16k packets. What happens is this: thread 1 
is mostly idle and just picking a few packets from NIC and every one of these 
small frames creates an entry in the handoff queue. Now thread 2 picks one 
element from the handoff queue and deals with it before picking another one. If 
the queue has only 3-packet or 10-packet elements, then thread 2 can never 
really get into what VPP excels in - bulk processing.

Q: Why doesn’t it pick as many packets as possible from the handoff queue? 
A: It’s not implemented.

I already wrote a patch for it, which made all congestion drops which I saw (in 
above synthetic test case) disappear. Mentioned patch 
https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.

Would you like to give it a try and see if it helps your issue? We shouldn’t 
need big queues under mild loads anyway …

Regards,
Klement

> On 13 Nov 2020, at 16:03, Elias Rudberg  wrote:
> 
> Hello VPP experts,
> 
> We are using VPP for NAT44 and we get some "congestion drops", in a
> situation where we think VPP is far from overloaded in general. Then
> we started to investigate if it would help to use a larger handoff
> frame queue size. In theory at least, allowing a longer queue could
> help avoiding drops in case of short spikes of traffic, or if it
> happens that some worker thread is temporarily busy for whatever
> reason.
> 
> The NAT worker handoff frame queue size is hard-coded in the
> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value is
> 64. The idea is that putting a larger value there could help.
> 
> We have run some tests where we changed the NAT_FQ_NELTS value from 64
> to a range of other values, each time rebuilding VPP and running an
> identical test, a test case that is to some extent trying to mimic our
> real traffic, although of course it is simplified. The test runs many
> iperf3 tests simultaneously using TCP, combined with some UDP traffic
> chosen to trigger VPP to create more new sessions (to make the NAT
> "slowpath" happen more).
> 
> The following NAT_FQ_NELTS values were tested:
> 16
> 32
> 64  <-- current value
> 128
> 256
> 512
> 1024
> 2048  <-- best performance in our tests
> 4096
> 8192
> 16384
> 32768
> 65536
> 131072
> 
> In those tests, performance was very bad for the smallest NAT_FQ_NELTS
> values of 16 and 32, while values larger than 64 gave improved
> performance. The best results in terms of throughput were seen for
> NAT_FQ_NELTS=2048. For even larger values than that, we got reduced
> performance compared to the 2048 case.
> 
> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server
> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards. The
> number of NAT threads was 8 in some of the tests and 4 in some of the
> tests.
> 
> According to these tests, the effect of changing NAT_FQ_NELTS can be
> quite large. For example, for one test case chosen such that
> congestion drops were a significant problem, the throughput increased
> from about 43 to 90 Gbit/second with the amount of congestion drops
> per second reduced to about one third. In another kind of test,
> throughput increased by about 20% with congestion drops reduced to
> zero. Of course such results depend a lot on how the tests are
> constructed. But anyway, it seems clear that the choice of
> NAT_FQ_NELTS value can be important and that increasing it would be
> good, at least for the kind of usage we have tested now.
> 
> Based on the above, we are considering changing NAT_FQ_NELTS from 64
> to a larger value and start trying that in our production environment
> (so far we have only tried it

Re: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

2020-11-13 Thread Klement Sekera via lists.fd.io
Hi Elias,

I’ve already debugged this and came to the conclusion that it’s the infra which 
is the weak link. I was seeing congestion drops at mild load, but not at full 
load. Issue is that with handoff, there is uneven workload. For simplicity’s 
sake, just consider thread 1 handing off all the traffic to thread 2. What 
happens is that for thread 1, the job is much easier, it just does some ip4 
parsing and then hands packet to thread 2, which actually does the heavy 
lifting of hash inserts/lookups/translation etc. 64 element queue can hold 64 
frames, one extreme is 64 1-packet frames, totalling 64 packets, other extreme 
is 64 255-packet frames, totalling ~16k packets. What happens is this: thread 1 
is mostly idle and just picking a few packets from NIC and every one of these 
small frames creates an entry in the handoff queue. Now thread 2 picks one 
element from the handoff queue and deals with it before picking another one. If 
the queue has only 3-packet or 10-packet elements, then thread 2 can never 
really get into what VPP excels in - bulk processing.

Q: Why doesn’t it pick as many packets as possible from the handoff queue? 
A: It’s not implemented.

I already wrote a patch for it, which made all congestion drops which I saw (in 
above synthetic test case) disappear. Mentioned patch 
https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.

Would you like to give it a try and see if it helps your issue? We shouldn’t 
need big queues under mild loads anyway …

Regards,
Klement

> On 13 Nov 2020, at 16:03, Elias Rudberg  wrote:
> 
> Hello VPP experts,
> 
> We are using VPP for NAT44 and we get some "congestion drops", in a
> situation where we think VPP is far from overloaded in general. Then
> we started to investigate if it would help to use a larger handoff
> frame queue size. In theory at least, allowing a longer queue could
> help avoiding drops in case of short spikes of traffic, or if it
> happens that some worker thread is temporarily busy for whatever
> reason.
> 
> The NAT worker handoff frame queue size is hard-coded in the
> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value is
> 64. The idea is that putting a larger value there could help.
> 
> We have run some tests where we changed the NAT_FQ_NELTS value from 64
> to a range of other values, each time rebuilding VPP and running an
> identical test, a test case that is to some extent trying to mimic our
> real traffic, although of course it is simplified. The test runs many
> iperf3 tests simultaneously using TCP, combined with some UDP traffic
> chosen to trigger VPP to create more new sessions (to make the NAT
> "slowpath" happen more).
> 
> The following NAT_FQ_NELTS values were tested:
> 16
> 32
> 64  <-- current value
> 128
> 256
> 512
> 1024
> 2048  <-- best performance in our tests
> 4096
> 8192
> 16384
> 32768
> 65536
> 131072
> 
> In those tests, performance was very bad for the smallest NAT_FQ_NELTS
> values of 16 and 32, while values larger than 64 gave improved
> performance. The best results in terms of throughput were seen for
> NAT_FQ_NELTS=2048. For even larger values than that, we got reduced
> performance compared to the 2048 case.
> 
> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server
> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards. The
> number of NAT threads was 8 in some of the tests and 4 in some of the
> tests.
> 
> According to these tests, the effect of changing NAT_FQ_NELTS can be
> quite large. For example, for one test case chosen such that
> congestion drops were a significant problem, the throughput increased
> from about 43 to 90 Gbit/second with the amount of congestion drops
> per second reduced to about one third. In another kind of test,
> throughput increased by about 20% with congestion drops reduced to
> zero. Of course such results depend a lot on how the tests are
> constructed. But anyway, it seems clear that the choice of
> NAT_FQ_NELTS value can be important and that increasing it would be
> good, at least for the kind of usage we have tested now.
> 
> Based on the above, we are considering changing NAT_FQ_NELTS from 64
> to a larger value and start trying that in our production environment
> (so far we have only tried it in a test environment).
> 
> Were there specific reasons for setting NAT_FQ_NELTS to 64?
> 
> Are there some potential drawbacks or dangers of changing it to a
> larger value?
> 
> Would you consider changing to a larger value in the official VPP
> code?
> 
> Best regards,
> Elias
> 
> 
> 
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18013): https://lists.fd.io/g/vpp-dev/message/18013
Mute This Topic: https://lists.fd.io/mt/78230881/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-