Dear Rubina,

I have adjusted my trex config to match yours - increased 2m to 4m,
but that  still didn't change much.

The only thing your config could be adapted is to have "deny" instead
of "drop" in the ACL configuration - right now one of your ACLs does
not have any rules, it is best to avoid this config.

*However*, I have spotted another difference between our setups, which
turns out important.

in your "show acl-plugin sessions" the session creation/deletion is
concentrated on a single worker, whereas in my case it is handled
pretty evenly by two of them.

This is my distribution of the workers to the interfaces during the
test run (short timeouts, the total session count is hovering just
under 300K):

vpp# show int rx-placement
Thread 1 (vpp_wk_0):
  node dpdk-input:
    TenGigabitEthernet81/0/0 queue 0 (polling)
Thread 2 (vpp_wk_1):
  node dpdk-input:
    TenGigabitEthernet81/0/1 queue 0 (polling)
vpp#


If I do this change:

vpp# set interface rx-placement TenGigabitEthernet81/0/1 worker 0
vpp# show int rx-placement
Thread 1 (vpp_wk_0):
  node dpdk-input:
    TenGigabitEthernet81/0/0 queue 0 (polling)
    TenGigabitEthernet81/0/1 queue 0 (polling)
vpp#

then the session count climbs to 1m relatively quickly, and when it
reaches 1m, we stop creating new sessions and forwarding the traffic
for the new sessions. The connections still do clean up periodically,
but the cleanup rate is too slow. In my setup I run both the t-rex and
VPP on the same machine, different cores - so I would expect that if
you have two different machines, this overload effect is more
pronounced.

When I stop the traffic, the session count slowly goes down to 0, and
if I change back the rx-placement of the interfaces to be the same as
before, then I can again run the test successfully for a long time.

So, seems like the cleaner node does not cope when there is a higher load.


The way I intended the cleanup mechanism to work, is to have up to
this number of connections cleaned up by the workers during the
(variable) interrupt interval, and if we need to clean up more, the
interrupt interval should be reduced by half. TCP-style feedback loop,
basically, which should self-balance at around the current connection
cleanup rate.

To investigate further I had tried this change:

diff --git a/src/plugins/acl/acl.h b/src/plugins/acl/acl.h
index 07ed868..98291c5 100644
--- a/src/plugins/acl/acl.h
+++ b/src/plugins/acl/acl.h
@@ -242,7 +242,7 @@ typedef struct {
    * of connections, it halves the sleep time.
    */

-#define ACL_FA_DEFAULT_MAX_DELETED_SESSIONS_PER_INTERVAL 100
+#define ACL_FA_DEFAULT_MAX_DELETED_SESSIONS_PER_INTERVAL 10000
   u64 fa_max_deleted_sessions_per_interval;

   /*

And the previously "stuck" tests worked fine - the sessions were being
cleaned up, and the session count was again hovering at about 300K,
even with both interfaces serviced by the same worker.

So, it seems like either my code of the interrupt sending is
misbehaving, or the threads under loads don't get enough interrupts.
Assuming it's the same issue we are seeing.

So, could you please do as follows:


0) run the multicore test (short timeout), which will fail, then stop
the traffic and wait for session count to go to 0

1) rebalance the interfaces onto different workers  as above(this
merely lowers the load per-worker, hence the multicore test)

2) retry the multicore test. It may or may not run successfully. Would
be interesting to know the result.

3) apply the diff above, rebuild vpp and repeat the steps 0-2.

This will help us ensure that I have indeed reproduced the same issue
as seen in your setup.

Thanks a lot!

--a




On 3/13/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
> Dear Andrew
>
> My Trex config is uploaded; I also tested the scenario with your Trex
> config.
> The stability of vpp in your run is strange. When I run this scenario, vpp
> crashes in my DUT machine after about 200 second of running Trex.
> In this period I see #del sessions is 0 until session pool becomes full,
> after that session deletion starts. But its rate is lower than the one I see
> when I run vpp on single core.
>
> Could you please check my configs once again for any misconfiguration?
> Is vpp or dpdk compatible or incompatible with any specified device?
>
> Thanks,
> Sincerely
>
> Sent from Outlook<http://aka.ms/weboutlook>
> ________________________________
> From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
> Sent: Monday, March 12, 2018 1:50 PM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Freezing Session Deletion Operation
>
> Dear Rubina,
>
> I've tried the test locally using the data that you sent, here is the
> output from my trex after 10 minutes running:
>
> -Per port stats table
>       ports |               0 |               1
> -----------------------------------------------------------------------------------------
>    opackets |       312605970 |       312191927
>      obytes |    100919855857 |    174147108346
>    ipackets |       311329098 |       277120788
>      ibytes |    173666531289 |     76492053900
>     ierrors |               0 |               0
>     oerrors |               0 |               0
>       Tx Bw |       1.17 Gbps |       2.01 Gbps
>
> -Global stats enabled
>  Cpu Utilization : 21.2  %  30.0 Gb/core
>  Platform_factor : 1.0
>  Total-Tx        :       3.18 Gbps
>  Total-Rx        :       2.89 Gbps
>  Total-PPS       :     901.93 Kpps
>  Total-CPS       :      13.52 Kcps
>
>  Expected-PPS    :     901.92 Kpps
>  Expected-CPS    :      13.53 Kcps
>  Expected-BPS    :       3.18 Gbps
>
>  Active-flows    :     8883  Clients :      255   Socket-util : 0.0553 %
>  Open-flows      :  9425526  Servers :    65535   Socket :     8883
> Socket/Clients :  34.8
>  drop-rate       :       0.00  bps
>  current time    : 702.8 sec
>  test duration   : 2897.2 sec
>
> So, in my setup worked I could not see the behavior you describe...
>
> But we have at least one more thing that may be different between our
> setups - is the trex config.
>
> Here is what mine looks like:
>
> - version: 2
>   interfaces: ['03:00.0', '03:00.1']
>   port_limit: 2
>   memory:
>        dp_flows: 2000000
>   port_info:
>        - ip: 1.1.1.1
>          default_gw: 1.1.1.2
>        - ip: 1.1.1.2
>          default_gw: 1.1.1.1
>
>
> Could you send me your trex config to see if that might be the
> difference between our setups, so I could try it locally ?
>
> Thanks!
>
> --a
>
> On 3/12/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
>> Hi Dear Andrew
>>
>> I repeated once again my scenarios with short timeouts and upload all
>> configs and outputs for your consideration.
>> I am clear about that session cleaner process doesn't work properly and
>> my
>> Trex throughput  stuck at 0.
>> Please repeat this scenario to verify this (Unfortunately vpp is just
>> stable
>> for 200 second and after that vpp will be down).
>>
>> Thanks,
>> Sincerely
>>
>> Sent from Outlook<http://aka.ms/weboutlook>
>> ________________________________
>> From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
>> Sent: Sunday, March 11, 2018 3:48 PM
>> To: Rubina Bianchi
>> Cc: vpp-dev@lists.fd.io
>> Subject: Re: [vpp-dev] Freezing Session Deletion Operation
>>
>> Hi Rubina,
>>
>> I am assuming you are observing this both in single core and multicore
>> scenario ?
>>
>> Based on the outputs, this is what I think might be going on:
>>
>> I am seeing the total# of sessions is 1000000, and no TCP transient
>> sessions - thus the packets that require a a session are dropped.
>>
>> What is a bit peculiar, is that the session delete# per-worker are
>> non-zero, yet the the delete counters are zero. To me this indicates
>> there was a fair bit of transient sessions, which also then got
>> recycled by the TCP sessions properly established, before the idle
>> timeout has expired.
>>
>> And at the moment of taking the show command output the connection
>> cleaner activity has not yet kicked in - I do not see either any
>> session deleted by idle timeout nor its timer restarted. Which makes
>> me think that the time interval in which you are testing must be
>> relatively short...
>>
>> So, assuming the time between the start of the traffic and the time
>> you have 1m sessions is quite short, this is simply using up all of
>> the connection pool, a classic inherent resource management issue with
>> any stateful scenario.
>>
>> You can verify that the sessions delete and start building again if
>> you issue "clear acl-plugin sessions".
>>
>> Also, changing the session timeouts to more aggressive values (say, 10
>> seconds), should kick off the aggressive connection cleaning, thus
>> should unlock this condition. Of course, shorter idle time means
>> potentially useful connections removed.  (the commands are "set
>> acl-plugin session timeout <udp|tcp> idle <X>").
>>
>> *if* neither of  the above does not adequately describe what you are
>> seeing, the cleaner node
>> may for whatever reason ceases to kick in every half a second.
>>
>> To see the dynamics of conn cleaner node, you can use the debug command
>> "set acl-plugin session event-trace 1" before the start of the test.
>> This will produce the event trace, which you can view by "show
>> event-logger all" - this should give a reasonable idea about what the
>> cleaner node is up to.
>>
>> Please let me know.
>>
>> --a
>>
>>
>>
>>
>>
>> On 3/11/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
>>> Hi,
>>>
>>> I am testing vpp_18.01.1-124~g302ef0b5 (commit:
>>> 696e0da1cde7e2bcda5a781f9ad286672de8dcbf) and
>>> vpp_18.04-rc0~322-g30787378
>>> (commit: 30787378f49714319e75437b347b7027a949700d) using Trex with sfr
>>> scenario in one core and multicore state.
>>> After a while I saw session deletion rate decreases and vpp throughput
>>> becomes 0 bps.
>>> All configuration files and outputs are attached.
>>>
>>> Thanks,
>>> Sincerely
>>>
>>> Sent from Outlook<http://aka.ms/weboutlook>
>>>
>>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links:

You receive all messages sent to this group.

View/Reply Online (#8517): https://lists.fd.io/g/vpp-dev/message/8517
View All Messages In Topic (6): https://lists.fd.io/g/vpp-dev/topic/14475974
Mute This Topic: https://lists.fd.io/mt/14475974/21656
New Topic: https://lists.fd.io/g/vpp-dev/post

Change Your Subscription: https://lists.fd.io/g/vpp-dev/editsub/21656
Group Home: https://lists.fd.io/g/vpp-dev
Contact Group Owner: vpp-dev+ow...@lists.fd.io
Terms of Service: https://lists.fd.io/static/tos
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to