Re: [ovs-dev] User space connection tracking benchmarks

Aaron Conole Fri, 08 Jul 2022 06:57:28 -0700

Paolo Valerio <pvale...@redhat.com> writes:

> Paolo Valerio <pvale...@redhat.com> writes:
>
>> Ilya Maximets <i.maxim...@ovn.org> writes:
>>
>>> On 6/20/22 23:57, Paolo Valerio wrote:
>>>> Ilya Maximets <i.maxim...@ovn.org> writes:
>>>> 
>>>>> On 6/7/22 11:39, Robin Jarry wrote:
>>>>>> Paolo Valerio, Jun 05, 2022 at 19:37:
>>>>>>> Just a note that may be useful.
>>>>>>> After some tests, I noticed that establishing e.g. two TCP connections,
>>>>>>> and leaving the first one idle after 3whs, once the second connection
>>>>>>> expires (after moving to TIME_WAIT as a result of termination), the
>>>>>>> second doesn't get evicted until any event gets scheduled for the first.
>>>>>>>
>>>>>>> ovs-appctl dpctl/dump-conntrack -s
>>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED)
>>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>>>>>
>>>>>>> This may be somewhat related to your results as during the
>>>>>>> test, the number of connections may reach the limit so apparently 
>>>>>>> reducing
>>>>>>> the performances.
>>>>>>
>>>>>> Indeed, there was an issue in my test procedure. Due to the way T-Rex
>>>>>> generates connections, it is easy to fill the conntrack table after
>>>>>> a few iterations, making the test results inconsistent.
>>>>>>
>>>>>> Also, the flows which I had configured were not correct. There was an
>>>>>> extraneous action=NORMAL flow at the end. When the conntrack table is
>>>>>> full and a new packet cannot be tracked, it is marked as +trk+inv and
>>>>>> not dropped. This behaviour is specific to the userspace datapath. The
>>>>>> linux kernel datapath seems to drop the packet when it cannot be added
>>>>>> to connection tracking.
>>>>>>
>>>>>> Gaëtan's series (v4) seems less resilient to the conntrack table being
>>>>>> full, especially when there is more than one PMD core.
>>>>>>
>>>>>> I have changed the t-rex script to allow running arbitrary commands in
>>>>>> between traffic iterations. This is leveraged to flush the conntrack
>>>>>> table and run each iteration in the same conditions.
>>>>>>
>>>>>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py
>>>>>>
>>>>>> To avoid filling the conntrack table, the max size was increased to 50M.
>>>>>> The DUT configuration can be summarized as the following:
>>>>>>
>>>>>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true
>>>>>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554"
>>>>>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
>>>>>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \
>>>>>>     options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 
>>>>>> options:n_rxq_desc=4096
>>>>>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \
>>>>>>     options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 
>>>>>> options:n_rxq_desc=4096
>>>>>> ovs-appctl dpctl/ct-set-maxconns 50000000
>>>>>> ovs-ofctl add-flow br0 
>>>>>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)"
>>>>>> ovs-ofctl add-flow br0 
>>>>>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL"
>>>>>> ovs-ofctl add-flow br0 
>>>>>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL"
>>>>>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop"
>>>>>>
>>>>>> Short Lived Connections
>>>>>> -----------------------
>>>>>>
>>>>>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \
>>>>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>>>>     --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \
>>>>>>     --min-cps 10k --max-cps 600k
>>>>>>
>>>>>> ============== =============== ============== ============== ===========
>>>>>> Series         Num. Flows      CPS            PPS            BPS
>>>>>> ============== =============== ============== ============== ===========
>>>>>> Baseline       40.1K           79.3K/s        556Kp/s        347Mb/s
>>>>>> Gaetan v1      60.5K           121K/s         837Kp/s        522Mb/s
>>>>>> Gaetan v4      61.4K           122K/s         850Kp/s        530Mb/s
>>>>>> Paolo          377K            756K/s         5.3Mp/s        3.3Gb/s
>>>>>> ============== =============== ============== ============== ===========
>>>>>>
>>>>>> Even after fixing the test procedure, Paolo's series still performs
>>>>>> a lot better with short lived connections.
>>>>>>
>>>>>> Long Lived Connections
>>>>>> ----------------------
>>>>>>
>>>>>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \
>>>>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>>>>     --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 
>>>>>> 50 \
>>>>>>     --min-cps 100 --max-cps 10k
>>>>>>
>>>>>> ============== =============== ============== ============== ===========
>>>>>> Series         Num. Flows      CPS            PPS            BPS
>>>>>> ============== =============== ============== ============== ===========
>>>>>> Baseline       17.4K           504/s          633Kp/s        422Mb/s
>>>>>> Gaetan v1      80.4K           3.1K/s         4.6Mp/s        3.0Gb/s
>>>>>> Gaetan v4      139K            5.4K/s         8.2Mp/s        5.4Gb/s
>>>>>> Paolo          132K            5.2K/s         7.7Mp/s        5.2Gb/s
>>>>>> ============== =============== ============== ============== ===========
>>>>>>
>>>>>> Thanks to Paolo for his help on this second round of tests.
>>>>>
>>>>> Hi, Robin and Paolo.
>>>>>
>>>> 
>>>> Hello Ilya,
>>>> 
>>>> thanks for the feedback about this.
>>>> 
>>>>> Could you try the following change on top of Gaetan's v1 and v4:
>>>>>
>>>>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>>>>> index aa4792f69..2c2203a4b 100644
>>>>> --- a/lib/conntrack.c
>>>>> +++ b/lib/conntrack.c
>>>>> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct 
>>>>> conn_key *key,
>>>>>      bool found = false;
>>>>>  
>>>>>      CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) {
>>>>> -        if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) {
>>>>> +        if (conn_expired(conn, now)) {
>>>>> +            continue;
>>>>> +        }
>>>>> +        if (!conn_key_cmp(&conn->key, key)) {
>>>>>              found = true;
>>>>>              if (reply) {
>>>>>                  *reply = false;
>>>>>              }
>>>>>              break;
>>>>>          }
>>>>> -        if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, 
>>>>> now)) {
>>>>> +        if (!conn_key_cmp(&conn->rev_key, key)) {
>>>>>              found = true;
>>>>>              if (reply) {
>>>>>                  *reply = true;
>>>>> ---
>>>>>
>>>>> ?
>>>>>
>>>>> Let me explain.  As far as I can tell without actually running performance
>>>>> tests myself, the main issue with both variants of Gaetan's series is a 
>>>>> slow
>>>>> cleanup of expired connections.
>>>>>
>>>>> v4, IMO, contains a serious bug, because the expiration list is treated 
>>>>> at the
>>>>> sweep phase as if it was sorted by the expiration time.  But that is not 
>>>>> really
>>>>> true, because connections can not be re-ordered.  They seem to stay in 
>>>>> the exact
>>>>> same order as they were created, but expiration times are getting updated 
>>>>> when
>>>>> packets are received.  However, the sweep process stops at the moment the 
>>>>> first
>>>>> non-expired connection is seen and we're not cleaning up later connections
>>>>> as reported by Paolo.  Taking into account the fact that default timeout 
>>>>> policy
>>>>> for established TCP connections appears to be 24 hours (or did I misread 
>>>>> this?),
>>>>> closed connections can stay not cleaned up for that time.  That seems like
>>>>> way too much.  The point is, the whole queue should be inspected on every 
>>>>> sweep
>>>>> iteration in order to re-sort it and actually move connections between 
>>>>> different
>>>>> expiration lists to avoid silent established connections blocking closed 
>>>>> ones.
>>>>> This will probably be slow (?).  MPSC queue doesn't seem to be very 
>>>>> suitable for
>>>>> such usecase.
>>>> 
>>>> I agree with the cause of the issue, and about the MPSC suitability as
>>>> well unless they are used in different ways.
>>>> 
>>>> Speaking of connections blocking other connections, there's a potential
>>>> problem in the current code and it is related to zones and timeout
>>>> policy.
>>>> 
>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT)
>>>> 
>>>> In this case I set the timeout for tcp_end to 200 for zone 0, and 20
>>>> seconds to zone 1.
>>>> The problem is less severe compared to the ct-scale series, but could
>>>> (more or less significantly, depends on the way timeouts are configured)
>>>> increase the latency of a connection in the same list.
>>>
>>> Yep.  I noticed that issue too.  It is a bug in the current implementation.
>>> Expiration lists should not be global, but per-zone, AFAICT.
>>>
>>>> 
>>>>>
>>>>> v1.  Here the situation is better since threads are kind of allowed to 
>>>>> reschedule
>>>>> on their own, but via RCU thread.  It looks like re-scheduled connections 
>>>>> can be
>>>>> messed up between each other still though, because expiration can be 
>>>>> updated
>>>>> multiple times, while re-insert is happening once.  And also these 
>>>>> remove+push_back
>>>> 
>>>> which also makes me wonder if there's a chance of losing a timeout
>>>> transition.
>>>> 
>>>>> operations for different threads will be additionally re-ordered by the 
>>>>> RCU
>>>>> thread.  So, the order is not really preserved here as well.  In order to 
>>>>> fix
>>>>> that we need to replace push_back operation with the search+insert, i.e. 
>>>>> find the
>>>>> place where the connection should be inserted.  And even that will not 
>>>>> fully
>>>>> protect from the concurrent update of the expiration.  OTOH, not fully 
>>>>> sorted
>>>>> list may be not a huge problem, because connections are still moved 
>>>>> between
>>>>> different expiration lists, so ESTABLISHED connection can not block clean 
>>>>> up of
>>>>> the closed one. 
>>>>> The RCU thread though can be a serious bottleneck in this concept.  PMD 
>>>>> threads
>>>>> are entering quiescent state only once per 1024 iterations or even more, 
>>>>> so
>>>>> we can have 64K or more updates per thread per grace period.   So, 
>>>>> expired or
>>>>> closed connections can stay for a prolonged period of time.
>>>> 
>>>> That's a good point.
>>>> 
>>>>>
>>>>> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of 
>>>>> elements
>>>>> in a hash map and lookup is likely slow because of this.  Also, 
>>>>> conn_key_cmp()
>>>>> is an extremely heavy operation, and we're performing it twice for every 
>>>>> connection
>>>>> with the same hash.  Paolo's series cheats here by quickly skipping or 
>>>>> even
>>>>> cleaning up expired connection during the lookup.  Gaetan's series' are 
>>>>> not able
>>>>> to clean them, but can still quickly skip all expired ones without 
>>>>> executing
>>>>> two costly comparisons.  Hence the code snippet above.  Previously, 
>>>>> conn_expired
>>>>> check required the global lock, but it doesn't with any of the series 
>>>>> applied.
>>>>>
>>>>>
>>>>> In conclusion, I don't think that we can efficiently use MPSC queue 
>>>>> directly.
>>>>> At least, we need an ability to efficiently move connections between 
>>>>> lists to
>>>>> avoid CLOSED connections being blocked by ESTABLISHED ones.  One option is
>>>>> to always look through the whole list and re-insert every connection.  It 
>>>>> might
>>>>> be not a huge problem though.  It was a problem when we had to hold the 
>>>>> global
>>>>> mutex for that.  If we don't need to hold the global mutex, we may allow 
>>>>> that.
>>>> 
>>>> Some thoughts/questions about it:
>>>> 
>>>> does this require to go all the way through the list at once?
>>>> Probably it's not a problem, but in case of many connections
>>>> destruction/creation, isn't there a chance we keep going through the
>>>> list for a long time?
>>>
>>> I'm not sure if that is a problem or not, but I don't really like that
>>> implementation either way.  Such extensive and awkward re-queueing will
>>> not look great.
>>>
>>>> 
>>>> For sure it is the most relevant, but I wonder if this is only
>>>> applicable to the ESTABLISHED list.
>>>> I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts.
>>>
>>> I don't think it's specific to ESTABLISHED list.  It's just more
>>> pronounced on this one, because of dramatic difference in timeout
>>> values.  In general, entries in a list should all have the same
>>> timeout, otherwise we may fall into some nasty corner cases.
>>>
>>
>> I know it could not be ideal, but using more rcu_lists (during the
>> connection creation the conns get assigned in round robin to one lists
>> to have them balanced), not associated with any timeout, and scanning
>> them, without reordering, all the way at every reschedule time checking
>> for expiration only (atomically updated by the PMDs) should solve the
>> problems (even the one related to the zones). This approach would add an
>> extra cost to the sweeper, because it has to scan every time all the
>> connections, but push_{front,back} and event handling also come with a
>> non negligible cost.
>>
>> More rcu_lists may be needed because this would allow the sweeper to
>> yield the CPU (once exceeded the budget) and restart from the next list
>> in line the next round, if needed. I guess we could also reduce the
>> creation/destruction problem (which lead the sweeper to a long walk
>> through the lists) pushing front every new connection created.
>>
>
> Ilya,
>
> Just to confirm that v5 [1] has been tested multiple time with the same
> scripts written and used by Robin, and it performs on average slightly
> better in both scenario than v4, so the numbers reported by Robin are
> applicable to v5 as well.
>
> The issues that affected v4 are no longer present and it passes all the
> conntrack related unit tests.
>
> [1] https://patchwork.ozlabs.org/project/openvswitch/list/?series=307380


Just want to confirm that some of the order / performance discussed
issues aren't there any longer.  Were there any noticeable CPU %
changes, especially w.r.t. pmd handler thread usage?  I would expect
usage % to go up in both cases (since we won't be serialized around the
big ct lock any longer in some iterating cases).  But I didn't see
anything about it (maybe I missed it).

Additionally, the latest timeout policy change was quite a bit easier to
reason about and to review.  I think we can take some of the code
cleanup work from the buckets approach, but I guess the timeouts policy
approach might overall be easier to maintain for now, especially this
close to the branching point.

>>>> 
>>>> Otherwise,
>>>> 
>>>>> If the list itself is too long, we may just have normal lists and use MPSC
>>>>> queue to enqueue "update" events for the ct_sweep thread the same way as 
>>>>> what
>>>>> is done for RCU thread in v1.  Can be a possible solution?
>>>>>
>>>> 
>>>> I guess this involve some reorder (any update could) inside the same
>>>> list.
>>>> 
>>>> In such case we have to handle both the events (and reorder, if we
>>>> care), and the evictions in two separate moments.
>>>
>>> I'm not sure if the strict order is required.  Relatively ordered
>>> list where all re-scheduled entries are always after non-rescheduled
>>> ones might be sufficient, if we're not looking for extreme accuracy.
>>> This should work fine, I guess, as long as we re-schedule everything
>>> that was updated within, let's say, a second, and as long as all
>>> connections in the same list has exactly the same timeout value.
>>>
>>>> 
>>>>> rculist has a chance to be a viable option, would be great to look at 
>>>>> performance
>>>>> numbers with the code snippet above.
>>>>>
>>>>> Paolo's series seems to not have any obvious issues, at least I didn't 
>>>>> spot
>>>>> any on a quick glance.  At least, until buckets are balanced.  In some 
>>>>> cases
>>>>> we may end up locking the same bucket from different threads frequently 
>>>>> though.
>>
>> that's right. I suppose that this would affect the ability to sustain a
>> high rate of conns creation/destruction, as the locks are acquired in
>> that case. The short-lived connection test result would sink in that
>> case, but that should lead to numbers comparable to the other series, or
>> slightly worse (slightly worse is what I see recompiling with one bucket
>> only). A significant unbalance (like a significant number of connections
>> in very few buckets) might be detrimental for the sweeper, though (which
>> would run for a long time in that case).
>>
>> Long-lived or a bulk transfer with a single connection should not be
>> affected.
>>
>>>>> Do you have the original and reply traffic being handled by the same or
>>>>> different PMD threads in your performance tests?  Also, a lot of different
>>>>> connections is only one side of the performance.  Having a single 
>>>>> traffic-heavy
>>>>> connection where original and reply directions are handled by 2 different 
>>>>> PMD
>>>>> threads would be an important test as well.  I think, it is even closer 
>>>>> to the
>>>>> real world cases. 
>>>>>
>>>> 
>>>> Will be back on this and the v1/v4 + your snippet (which seems to be
>>>> potentially a measurable improvement) once some numbers are available.
>>>>
>>
>> Back on this and the single flow.
>>
>> We tested it and the numbers are not comparable with the previous, as we
>> tested it on different hw (and slightly different setup), but the
>> improvement we measured is around 5% on v4 and around 4% on v1 for short
>> lived connections.
>>
>> No measurable improvement for long lived connections, which makes sense
>> as it deals with much lower cps.
>>
>> Regarding the single connection, for the time being at least the test
>> was perfomed with a TCP connection with different pmds (same goes for
>> the previous tests) handling the connection and the performance look
>> comparable. Gaetan's v4 performs slightly better:
>>
>>
>> | series      | MSS 1400       | MSS 100        |
>> |-------------+----------------+----------------|
>> | ct-scale-v4 | 12.7 Gb/sec    | 1.60 Gb/sec    |
>> | ct-buckets  | 12.6 Gb/sec    | 1.54 Gb/sec    |
>> | baseline    | 11.9 Gb/sec    | 1.14 Gb/sec    |
>>
>> These numbers could be further confirmed using a more performing
>> generator (if needed), as the one used was not very performing, but the
>> expectation is that the numbers will be confirmed
>>
>>>>> Thoughts?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>> 
>
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] User space connection tracking benchmarks

Reply via email to