So you mean that this situation ( congestion drops) is most likely to occur 
when the system in general is idle than when it is processing a large amount of 
traffic?

Best Regards

Marcos

-----Mensagem original-----
De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement Sekera via 
lists.fd.io
Enviada em: sexta-feira, 13 de novembro de 2020 12:15
Para: Elias Rudberg <elias.rudb...@bahnhof.net>
Cc: vpp-dev@lists.fd.io
Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
NAT_FQ_NELTS to avoid congestion drops?

Hi Elias,

I’ve already debugged this and came to the conclusion that it’s the infra which 
is the weak link. I was seeing congestion drops at mild load, but not at full 
load. Issue is that with handoff, there is uneven workload. For simplicity’s 
sake, just consider thread 1 handing off all the traffic to thread 2. What 
happens is that for thread 1, the job is much easier, it just does some ip4 
parsing and then hands packet to thread 2, which actually does the heavy 
lifting of hash inserts/lookups/translation etc. 64 element queue can hold 64 
frames, one extreme is 64 1-packet frames, totalling 64 packets, other extreme 
is 64 255-packet frames, totalling ~16k packets. What happens is this: thread 1 
is mostly idle and just picking a few packets from NIC and every one of these 
small frames creates an entry in the handoff queue. Now thread 2 picks one 
element from the handoff queue and deals with it before picking another one. If 
the queue has only 3-packet or 10-packet elements, then thread 2 can never 
really get into what VPP excels in - bulk processing.

Q: Why doesn’t it pick as many packets as possible from the handoff queue? 
A: It’s not implemented.

I already wrote a patch for it, which made all congestion drops which I saw (in 
above synthetic test case) disappear. Mentioned patch 
https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.

Would you like to give it a try and see if it helps your issue? We shouldn’t 
need big queues under mild loads anyway …

Regards,
Klement

> On 13 Nov 2020, at 16:03, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
> 
> Hello VPP experts,
> 
> We are using VPP for NAT44 and we get some "congestion drops", in a 
> situation where we think VPP is far from overloaded in general. Then 
> we started to investigate if it would help to use a larger handoff 
> frame queue size. In theory at least, allowing a longer queue could 
> help avoiding drops in case of short spikes of traffic, or if it 
> happens that some worker thread is temporarily busy for whatever 
> reason.
> 
> The NAT worker handoff frame queue size is hard-coded in the 
> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value is 
> 64. The idea is that putting a larger value there could help.
> 
> We have run some tests where we changed the NAT_FQ_NELTS value from 64 
> to a range of other values, each time rebuilding VPP and running an 
> identical test, a test case that is to some extent trying to mimic our 
> real traffic, although of course it is simplified. The test runs many
> iperf3 tests simultaneously using TCP, combined with some UDP traffic 
> chosen to trigger VPP to create more new sessions (to make the NAT 
> "slowpath" happen more).
> 
> The following NAT_FQ_NELTS values were tested:
> 16
> 32
> 64  <-- current value
> 128
> 256
> 512
> 1024
> 2048  <-- best performance in our tests
> 4096
> 8192
> 16384
> 32768
> 65536
> 131072
> 
> In those tests, performance was very bad for the smallest NAT_FQ_NELTS 
> values of 16 and 32, while values larger than 64 gave improved 
> performance. The best results in terms of throughput were seen for 
> NAT_FQ_NELTS=2048. For even larger values than that, we got reduced 
> performance compared to the 2048 case.
> 
> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server 
> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards. The 
> number of NAT threads was 8 in some of the tests and 4 in some of the 
> tests.
> 
> According to these tests, the effect of changing NAT_FQ_NELTS can be 
> quite large. For example, for one test case chosen such that 
> congestion drops were a significant problem, the throughput increased 
> from about 43 to 90 Gbit/second with the amount of congestion drops 
> per second reduced to about one third. In another kind of test, 
> throughput increased by about 20% with congestion drops reduced to 
> zero. Of course such results depend a lot on how the tests are 
> constructed. But anyway, it seems clear that the choice of 
> NAT_FQ_NELTS value can be important and that increasing it would be 
> good, at least for the kind of usage we have tested now.
> 
> Based on the above, we are considering changing NAT_FQ_NELTS from 64 
> to a larger value and start trying that in our production environment 
> (so far we have only tried it in a test environment).
> 
> Were there specific reasons for setting NAT_FQ_NELTS to 64?
> 
> Are there some potential drawbacks or dangers of changing it to a 
> larger value?
> 
> Would you consider changing to a larger value in the official VPP 
> code?
> 
> Best regards,
> Elias
> 
> 
> 
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18016): https://lists.fd.io/g/vpp-dev/message/18016
Mute This Topic: https://lists.fd.io/mt/78233632/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

  • ... Elias Rudberg
    • ... Klement Sekera via lists.fd.io
      • ... Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via lists.fd.io
      • ... Marcos - Mgiga
        • ... ksekera via []
          • ... Marcos - Mgiga
            • ... Klement Sekera via lists.fd.io
              • ... Marcos - Mgiga
                • ... Klement Sekera via lists.fd.io
            • ... Christian Hopps
              • ... Honnappa Nagarahalli
              • ... Klement Sekera via lists.fd.io
                • ... Christian Hopps
      • ... Elias Rudberg

Reply via email to