If you can handle the traffic with a single thread then all multi-worker issues 
would go away. But the congestion drops are seen easily with as little as two 
workers due to infra limitations.

Regards,
Klement

> On 13 Nov 2020, at 18:41, Marcos - Mgiga <mar...@mgiga.com.br> wrote:
> 
> Thanks, you see reducing the number of VPP threads as an option to work this 
> issue around, since you would probably increase the vector rate per thread?
> 
> Best Regards
> 
> -----Mensagem original-----
> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement Sekera via 
> lists.fd.io
> Enviada em: sexta-feira, 13 de novembro de 2020 14:26
> Para: Marcos - Mgiga <mar...@mgiga.com.br>
> Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev <vpp-dev@lists.fd.io>
> Assunto: Re: RES: RES: [vpp-dev] Increasing NAT worker handoff frame queue 
> size NAT_FQ_NELTS to avoid congestion drops?
> 
> I used the usual
> 
> 1. start traffic
> 2. clear run
> 3. wait n seconds (e.g. n == 10)
> 4. show run
> 
> Klement
> 
>> On 13 Nov 2020, at 18:21, Marcos - Mgiga <mar...@mgiga.com.br> wrote:
>> 
>> Understood. And what path did you take in order to analyse and monitor 
>> vector rates ? Is there some specific command or log ?
>> 
>> Thanks
>> 
>> Marcos
>> 
>> -----Mensagem original-----
>> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de ksekera via 
>> [] Enviada em: sexta-feira, 13 de novembro de 2020 14:02
>> Para: Marcos - Mgiga <mar...@mgiga.com.br>
>> Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev@lists.fd.io
>> Assunto: Re: RES: [vpp-dev] Increasing NAT worker handoff frame queue size 
>> NAT_FQ_NELTS to avoid congestion drops?
>> 
>> Not completely idle, more like medium load. Vector rates at which I saw 
>> congestion drops were roughly 40 for thread doing no work (just handoffs - I 
>> hardcoded it this way for test purpose), and roughly 100 for thread picking 
>> the packets doing NAT.
>> 
>> What got me into infra investigation was the fact that once I was hitting 
>> vector rates around 255, I did see packet drops, but no congestion drops.
>> 
>> HTH,
>> Klement
>> 
>>> On 13 Nov 2020, at 17:51, Marcos - Mgiga <mar...@mgiga.com.br> wrote:
>>> 
>>> So you mean that this situation ( congestion drops) is most likely to occur 
>>> when the system in general is idle than when it is processing a large 
>>> amount of traffic?
>>> 
>>> Best Regards
>>> 
>>> Marcos
>>> 
>>> -----Mensagem original-----
>>> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement 
>>> Sekera via lists.fd.io Enviada em: sexta-feira, 13 de novembro de 
>>> 2020
>>> 12:15
>>> Para: Elias Rudberg <elias.rudb...@bahnhof.net>
>>> Cc: vpp-dev@lists.fd.io
>>> Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
>>> NAT_FQ_NELTS to avoid congestion drops?
>>> 
>>> Hi Elias,
>>> 
>>> I’ve already debugged this and came to the conclusion that it’s the infra 
>>> which is the weak link. I was seeing congestion drops at mild load, but not 
>>> at full load. Issue is that with handoff, there is uneven workload. For 
>>> simplicity’s sake, just consider thread 1 handing off all the traffic to 
>>> thread 2. What happens is that for thread 1, the job is much easier, it 
>>> just does some ip4 parsing and then hands packet to thread 2, which 
>>> actually does the heavy lifting of hash inserts/lookups/translation etc. 64 
>>> element queue can hold 64 frames, one extreme is 64 1-packet frames, 
>>> totalling 64 packets, other extreme is 64 255-packet frames, totalling ~16k 
>>> packets. What happens is this: thread 1 is mostly idle and just picking a 
>>> few packets from NIC and every one of these small frames creates an entry 
>>> in the handoff queue. Now thread 2 picks one element from the handoff queue 
>>> and deals with it before picking another one. If the queue has only 
>>> 3-packet or 10-packet elements, then thread 2 can never really get into 
>>> what VPP excels in - bulk processing.
>>> 
>>> Q: Why doesn’t it pick as many packets as possible from the handoff queue? 
>>> A: It’s not implemented.
>>> 
>>> I already wrote a patch for it, which made all congestion drops which I saw 
>>> (in above synthetic test case) disappear. Mentioned patch 
>>> https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.
>>> 
>>> Would you like to give it a try and see if it helps your issue? We 
>>> shouldn’t need big queues under mild loads anyway …
>>> 
>>> Regards,
>>> Klement
>>> 
>>>> On 13 Nov 2020, at 16:03, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
>>>> 
>>>> Hello VPP experts,
>>>> 
>>>> We are using VPP for NAT44 and we get some "congestion drops", in a 
>>>> situation where we think VPP is far from overloaded in general. Then 
>>>> we started to investigate if it would help to use a larger handoff 
>>>> frame queue size. In theory at least, allowing a longer queue could 
>>>> help avoiding drops in case of short spikes of traffic, or if it 
>>>> happens that some worker thread is temporarily busy for whatever 
>>>> reason.
>>>> 
>>>> The NAT worker handoff frame queue size is hard-coded in the 
>>>> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value 
>>>> is 64. The idea is that putting a larger value there could help.
>>>> 
>>>> We have run some tests where we changed the NAT_FQ_NELTS value from
>>>> 64 to a range of other values, each time rebuilding VPP and running 
>>>> an identical test, a test case that is to some extent trying to 
>>>> mimic our real traffic, although of course it is simplified. The 
>>>> test runs many
>>>> iperf3 tests simultaneously using TCP, combined with some UDP 
>>>> traffic chosen to trigger VPP to create more new sessions (to make 
>>>> the NAT "slowpath" happen more).
>>>> 
>>>> The following NAT_FQ_NELTS values were tested:
>>>> 16
>>>> 32
>>>> 64  <-- current value
>>>> 128
>>>> 256
>>>> 512
>>>> 1024
>>>> 2048  <-- best performance in our tests
>>>> 4096
>>>> 8192
>>>> 16384
>>>> 32768
>>>> 65536
>>>> 131072
>>>> 
>>>> In those tests, performance was very bad for the smallest 
>>>> NAT_FQ_NELTS values of 16 and 32, while values larger than 64 gave 
>>>> improved performance. The best results in terms of throughput were 
>>>> seen for NAT_FQ_NELTS=2048. For even larger values than that, we got 
>>>> reduced performance compared to the 2048 case.
>>>> 
>>>> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server 
>>>> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards.
>>>> The number of NAT threads was 8 in some of the tests and 4 in some 
>>>> of the tests.
>>>> 
>>>> According to these tests, the effect of changing NAT_FQ_NELTS can be 
>>>> quite large. For example, for one test case chosen such that 
>>>> congestion drops were a significant problem, the throughput 
>>>> increased from about 43 to 90 Gbit/second with the amount of 
>>>> congestion drops per second reduced to about one third. In another 
>>>> kind of test, throughput increased by about 20% with congestion 
>>>> drops reduced to zero. Of course such results depend a lot on how 
>>>> the tests are constructed. But anyway, it seems clear that the 
>>>> choice of NAT_FQ_NELTS value can be important and that increasing it 
>>>> would be good, at least for the kind of usage we have tested now.
>>>> 
>>>> Based on the above, we are considering changing NAT_FQ_NELTS from 64 
>>>> to a larger value and start trying that in our production 
>>>> environment (so far we have only tried it in a test environment).
>>>> 
>>>> Were there specific reasons for setting NAT_FQ_NELTS to 64?
>>>> 
>>>> Are there some potential drawbacks or dangers of changing it to a 
>>>> larger value?
>>>> 
>>>> Would you consider changing to a larger value in the official VPP 
>>>> code?
>>>> 
>>>> Best regards,
>>>> Elias
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18036): https://lists.fd.io/g/vpp-dev/message/18036
Mute This Topic: https://lists.fd.io/mt/78234850/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

  • ... Elias Rudberg
    • ... Klement Sekera via lists.fd.io
      • ... Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via lists.fd.io
      • ... Marcos - Mgiga
        • ... ksekera via []
          • ... Marcos - Mgiga
            • ... Klement Sekera via lists.fd.io
              • ... Marcos - Mgiga
                • ... Klement Sekera via lists.fd.io
            • ... Christian Hopps
              • ... Honnappa Nagarahalli
              • ... Klement Sekera via lists.fd.io
                • ... Christian Hopps
      • ... Elias Rudberg
        • ... Klement Sekera via lists.fd.io
          • ... Elias Rudberg
            • ... Elias Rudberg
            • ... Elias Rudberg
            • ... Elias Rudberg

Reply via email to