Aaron Conole <acon...@redhat.com> writes:

> Paolo Valerio <pvale...@redhat.com> writes:
>
>> Paolo Valerio <pvale...@redhat.com> writes:
>>
>>> Ilya Maximets <i.maxim...@ovn.org> writes:
>>>
>>>> On 6/20/22 23:57, Paolo Valerio wrote:
>>>>> Ilya Maximets <i.maxim...@ovn.org> writes:
>>>>> 
>>>>>> On 6/7/22 11:39, Robin Jarry wrote:
>>>>>>> Paolo Valerio, Jun 05, 2022 at 19:37:
>>>>>>>> Just a note that may be useful.
>>>>>>>> After some tests, I noticed that establishing e.g. two TCP connections,
>>>>>>>> and leaving the first one idle after 3whs, once the second connection
>>>>>>>> expires (after moving to TIME_WAIT as a result of termination), the
>>>>>>>> second doesn't get evicted until any event gets scheduled for the 
>>>>>>>> first.
>>>>>>>>
>>>>>>>> ovs-appctl dpctl/dump-conntrack -s
>>>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED)
>>>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>>>>>>
>>>>>>>> This may be somewhat related to your results as during the
>>>>>>>> test, the number of connections may reach the limit so apparently 
>>>>>>>> reducing
>>>>>>>> the performances.
>>>>>>>
>>>>>>> Indeed, there was an issue in my test procedure. Due to the way T-Rex
>>>>>>> generates connections, it is easy to fill the conntrack table after
>>>>>>> a few iterations, making the test results inconsistent.
>>>>>>>
>>>>>>> Also, the flows which I had configured were not correct. There was an
>>>>>>> extraneous action=NORMAL flow at the end. When the conntrack table is
>>>>>>> full and a new packet cannot be tracked, it is marked as +trk+inv and
>>>>>>> not dropped. This behaviour is specific to the userspace datapath. The
>>>>>>> linux kernel datapath seems to drop the packet when it cannot be added
>>>>>>> to connection tracking.
>>>>>>>
>>>>>>> Gaƫtan's series (v4) seems less resilient to the conntrack table being
>>>>>>> full, especially when there is more than one PMD core.
>>>>>>>
>>>>>>> I have changed the t-rex script to allow running arbitrary commands in
>>>>>>> between traffic iterations. This is leveraged to flush the conntrack
>>>>>>> table and run each iteration in the same conditions.
>>>>>>>
>>>>>>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py
>>>>>>>
>>>>>>> To avoid filling the conntrack table, the max size was increased to 50M.
>>>>>>> The DUT configuration can be summarized as the following:
>>>>>>>
>>>>>>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true
>>>>>>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554"
>>>>>>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
>>>>>>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \
>>>>>>>     options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 
>>>>>>> options:n_rxq_desc=4096
>>>>>>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \
>>>>>>>     options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 
>>>>>>> options:n_rxq_desc=4096
>>>>>>> ovs-appctl dpctl/ct-set-maxconns 50000000
>>>>>>> ovs-ofctl add-flow br0 
>>>>>>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)"
>>>>>>> ovs-ofctl add-flow br0 
>>>>>>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL"
>>>>>>> ovs-ofctl add-flow br0 
>>>>>>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL"
>>>>>>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop"
>>>>>>>
>>>>>>> Short Lived Connections
>>>>>>> -----------------------
>>>>>>>
>>>>>>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 
>>>>>>> \
>>>>>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>>>>>     --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \
>>>>>>>     --min-cps 10k --max-cps 600k
>>>>>>>
>>>>>>> ============== =============== ============== ============== ===========
>>>>>>> Series         Num. Flows      CPS            PPS            BPS
>>>>>>> ============== =============== ============== ============== ===========
>>>>>>> Baseline       40.1K           79.3K/s        556Kp/s        347Mb/s
>>>>>>> Gaetan v1      60.5K           121K/s         837Kp/s        522Mb/s
>>>>>>> Gaetan v4      61.4K           122K/s         850Kp/s        530Mb/s
>>>>>>> Paolo          377K            756K/s         5.3Mp/s        3.3Gb/s
>>>>>>> ============== =============== ============== ============== ===========
>>>>>>>
>>>>>>> Even after fixing the test procedure, Paolo's series still performs
>>>>>>> a lot better with short lived connections.
>>>>>>>
>>>>>>> Long Lived Connections
>>>>>>> ----------------------
>>>>>>>
>>>>>>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 
>>>>>>> \
>>>>>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>>>>>     --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 
>>>>>>> 50 \
>>>>>>>     --min-cps 100 --max-cps 10k
>>>>>>>
>>>>>>> ============== =============== ============== ============== ===========
>>>>>>> Series         Num. Flows      CPS            PPS            BPS
>>>>>>> ============== =============== ============== ============== ===========
>>>>>>> Baseline       17.4K           504/s          633Kp/s        422Mb/s
>>>>>>> Gaetan v1      80.4K           3.1K/s         4.6Mp/s        3.0Gb/s
>>>>>>> Gaetan v4      139K            5.4K/s         8.2Mp/s        5.4Gb/s
>>>>>>> Paolo          132K            5.2K/s         7.7Mp/s        5.2Gb/s
>>>>>>> ============== =============== ============== ============== ===========
>>>>>>>
>>>>>>> Thanks to Paolo for his help on this second round of tests.
>>>>>>
>>>>>> Hi, Robin and Paolo.
>>>>>>
>>>>> 
>>>>> Hello Ilya,
>>>>> 
>>>>> thanks for the feedback about this.
>>>>> 
>>>>>> Could you try the following change on top of Gaetan's v1 and v4:
>>>>>>
>>>>>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>>>>>> index aa4792f69..2c2203a4b 100644
>>>>>> --- a/lib/conntrack.c
>>>>>> +++ b/lib/conntrack.c
>>>>>> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct 
>>>>>> conn_key *key,
>>>>>>      bool found = false;
>>>>>>  
>>>>>>      CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) {
>>>>>> -        if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) 
>>>>>> {
>>>>>> +        if (conn_expired(conn, now)) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +        if (!conn_key_cmp(&conn->key, key)) {
>>>>>>              found = true;
>>>>>>              if (reply) {
>>>>>>                  *reply = false;
>>>>>>              }
>>>>>>              break;
>>>>>>          }
>>>>>> -        if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, 
>>>>>> now)) {
>>>>>> +        if (!conn_key_cmp(&conn->rev_key, key)) {
>>>>>>              found = true;
>>>>>>              if (reply) {
>>>>>>                  *reply = true;
>>>>>> ---
>>>>>>
>>>>>> ?
>>>>>>
>>>>>> Let me explain.  As far as I can tell without actually running 
>>>>>> performance
>>>>>> tests myself, the main issue with both variants of Gaetan's series is a 
>>>>>> slow
>>>>>> cleanup of expired connections.
>>>>>>
>>>>>> v4, IMO, contains a serious bug, because the expiration list is treated 
>>>>>> at the
>>>>>> sweep phase as if it was sorted by the expiration time.  But that is not 
>>>>>> really
>>>>>> true, because connections can not be re-ordered.  They seem to stay in 
>>>>>> the exact
>>>>>> same order as they were created, but expiration times are getting 
>>>>>> updated when
>>>>>> packets are received.  However, the sweep process stops at the moment 
>>>>>> the first
>>>>>> non-expired connection is seen and we're not cleaning up later 
>>>>>> connections
>>>>>> as reported by Paolo.  Taking into account the fact that default timeout 
>>>>>> policy
>>>>>> for established TCP connections appears to be 24 hours (or did I misread 
>>>>>> this?),
>>>>>> closed connections can stay not cleaned up for that time.  That seems 
>>>>>> like
>>>>>> way too much.  The point is, the whole queue should be inspected on 
>>>>>> every sweep
>>>>>> iteration in order to re-sort it and actually move connections between 
>>>>>> different
>>>>>> expiration lists to avoid silent established connections blocking closed 
>>>>>> ones.
>>>>>> This will probably be slow (?).  MPSC queue doesn't seem to be very 
>>>>>> suitable for
>>>>>> such usecase.
>>>>> 
>>>>> I agree with the cause of the issue, and about the MPSC suitability as
>>>>> well unless they are used in different ways.
>>>>> 
>>>>> Speaking of connections blocking other connections, there's a potential
>>>>> problem in the current code and it is related to zones and timeout
>>>>> policy.
>>>>> 
>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT)
>>>>> 
>>>>> In this case I set the timeout for tcp_end to 200 for zone 0, and 20
>>>>> seconds to zone 1.
>>>>> The problem is less severe compared to the ct-scale series, but could
>>>>> (more or less significantly, depends on the way timeouts are configured)
>>>>> increase the latency of a connection in the same list.
>>>>
>>>> Yep.  I noticed that issue too.  It is a bug in the current implementation.
>>>> Expiration lists should not be global, but per-zone, AFAICT.
>>>>
>>>>> 
>>>>>>
>>>>>> v1.  Here the situation is better since threads are kind of allowed to 
>>>>>> reschedule
>>>>>> on their own, but via RCU thread.  It looks like re-scheduled 
>>>>>> connections can be
>>>>>> messed up between each other still though, because expiration can be 
>>>>>> updated
>>>>>> multiple times, while re-insert is happening once.  And also these 
>>>>>> remove+push_back
>>>>> 
>>>>> which also makes me wonder if there's a chance of losing a timeout
>>>>> transition.
>>>>> 
>>>>>> operations for different threads will be additionally re-ordered by the 
>>>>>> RCU
>>>>>> thread.  So, the order is not really preserved here as well.  In order 
>>>>>> to fix
>>>>>> that we need to replace push_back operation with the search+insert, i.e. 
>>>>>> find the
>>>>>> place where the connection should be inserted.  And even that will not 
>>>>>> fully
>>>>>> protect from the concurrent update of the expiration.  OTOH, not fully 
>>>>>> sorted
>>>>>> list may be not a huge problem, because connections are still moved 
>>>>>> between
>>>>>> different expiration lists, so ESTABLISHED connection can not block 
>>>>>> clean up of
>>>>>> the closed one. 
>>>>>> The RCU thread though can be a serious bottleneck in this concept.  PMD 
>>>>>> threads
>>>>>> are entering quiescent state only once per 1024 iterations or even more, 
>>>>>> so
>>>>>> we can have 64K or more updates per thread per grace period.   So, 
>>>>>> expired or
>>>>>> closed connections can stay for a prolonged period of time.
>>>>> 
>>>>> That's a good point.
>>>>> 
>>>>>>
>>>>>> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of 
>>>>>> elements
>>>>>> in a hash map and lookup is likely slow because of this.  Also, 
>>>>>> conn_key_cmp()
>>>>>> is an extremely heavy operation, and we're performing it twice for every 
>>>>>> connection
>>>>>> with the same hash.  Paolo's series cheats here by quickly skipping or 
>>>>>> even
>>>>>> cleaning up expired connection during the lookup.  Gaetan's series' are 
>>>>>> not able
>>>>>> to clean them, but can still quickly skip all expired ones without 
>>>>>> executing
>>>>>> two costly comparisons.  Hence the code snippet above.  Previously, 
>>>>>> conn_expired
>>>>>> check required the global lock, but it doesn't with any of the series 
>>>>>> applied.
>>>>>>
>>>>>>
>>>>>> In conclusion, I don't think that we can efficiently use MPSC queue 
>>>>>> directly.
>>>>>> At least, we need an ability to efficiently move connections between 
>>>>>> lists to
>>>>>> avoid CLOSED connections being blocked by ESTABLISHED ones.  One option 
>>>>>> is
>>>>>> to always look through the whole list and re-insert every connection.  
>>>>>> It might
>>>>>> be not a huge problem though.  It was a problem when we had to hold the 
>>>>>> global
>>>>>> mutex for that.  If we don't need to hold the global mutex, we may allow 
>>>>>> that.
>>>>> 
>>>>> Some thoughts/questions about it:
>>>>> 
>>>>> does this require to go all the way through the list at once?
>>>>> Probably it's not a problem, but in case of many connections
>>>>> destruction/creation, isn't there a chance we keep going through the
>>>>> list for a long time?
>>>>
>>>> I'm not sure if that is a problem or not, but I don't really like that
>>>> implementation either way.  Such extensive and awkward re-queueing will
>>>> not look great.
>>>>
>>>>> 
>>>>> For sure it is the most relevant, but I wonder if this is only
>>>>> applicable to the ESTABLISHED list.
>>>>> I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts.
>>>>
>>>> I don't think it's specific to ESTABLISHED list.  It's just more
>>>> pronounced on this one, because of dramatic difference in timeout
>>>> values.  In general, entries in a list should all have the same
>>>> timeout, otherwise we may fall into some nasty corner cases.
>>>>
>>>
>>> I know it could not be ideal, but using more rcu_lists (during the
>>> connection creation the conns get assigned in round robin to one lists
>>> to have them balanced), not associated with any timeout, and scanning
>>> them, without reordering, all the way at every reschedule time checking
>>> for expiration only (atomically updated by the PMDs) should solve the
>>> problems (even the one related to the zones). This approach would add an
>>> extra cost to the sweeper, because it has to scan every time all the
>>> connections, but push_{front,back} and event handling also come with a
>>> non negligible cost.
>>>
>>> More rcu_lists may be needed because this would allow the sweeper to
>>> yield the CPU (once exceeded the budget) and restart from the next list
>>> in line the next round, if needed. I guess we could also reduce the
>>> creation/destruction problem (which lead the sweeper to a long walk
>>> through the lists) pushing front every new connection created.
>>>
>>
>> Ilya,
>>
>> Just to confirm that v5 [1] has been tested multiple time with the same
>> scripts written and used by Robin, and it performs on average slightly
>> better in both scenario than v4, so the numbers reported by Robin are
>> applicable to v5 as well.
>>
>> The issues that affected v4 are no longer present and it passes all the
>> conntrack related unit tests.
>>
>> [1] https://patchwork.ozlabs.org/project/openvswitch/list/?series=307380
>
> Just want to confirm that some of the order / performance discussed
> issues aren't there any longer.  Were there any noticeable CPU %
> changes, especially w.r.t. pmd handler thread usage?  I would expect
> usage % to go up in both cases (since we won't be serialized around the
> big ct lock any longer in some iterating cases).  But I didn't see
> anything about it (maybe I missed it).

Thank you Aaron.

You didn't miss it, they were not included, unfortunately.  I'd confirm
your expectation as we previously spent a significant amount of time on
the ct_lock especially considering it was needed for every connection
update (other than insertion and sweeping). This was what came up from
the previous analisys.

>
> Additionally, the latest timeout policy change was quite a bit easier to
> reason about and to review.  I think we can take some of the code
> cleanup work from the buckets approach, but I guess the timeouts policy
> approach might overall be easier to maintain for now, especially this
> close to the branching point.
>

I agree about this latest change [1] exactly for the reason you said.

[1] https://patchwork.ozlabs.org/project/openvswitch/list/?series=307380

>>>>> 
>>>>> Otherwise,
>>>>> 
>>>>>> If the list itself is too long, we may just have normal lists and use 
>>>>>> MPSC
>>>>>> queue to enqueue "update" events for the ct_sweep thread the same way as 
>>>>>> what
>>>>>> is done for RCU thread in v1.  Can be a possible solution?
>>>>>>
>>>>> 
>>>>> I guess this involve some reorder (any update could) inside the same
>>>>> list.
>>>>> 
>>>>> In such case we have to handle both the events (and reorder, if we
>>>>> care), and the evictions in two separate moments.
>>>>
>>>> I'm not sure if the strict order is required.  Relatively ordered
>>>> list where all re-scheduled entries are always after non-rescheduled
>>>> ones might be sufficient, if we're not looking for extreme accuracy.
>>>> This should work fine, I guess, as long as we re-schedule everything
>>>> that was updated within, let's say, a second, and as long as all
>>>> connections in the same list has exactly the same timeout value.
>>>>
>>>>> 
>>>>>> rculist has a chance to be a viable option, would be great to look at 
>>>>>> performance
>>>>>> numbers with the code snippet above.
>>>>>>
>>>>>> Paolo's series seems to not have any obvious issues, at least I didn't 
>>>>>> spot
>>>>>> any on a quick glance.  At least, until buckets are balanced.  In some 
>>>>>> cases
>>>>>> we may end up locking the same bucket from different threads frequently 
>>>>>> though.
>>>
>>> that's right. I suppose that this would affect the ability to sustain a
>>> high rate of conns creation/destruction, as the locks are acquired in
>>> that case. The short-lived connection test result would sink in that
>>> case, but that should lead to numbers comparable to the other series, or
>>> slightly worse (slightly worse is what I see recompiling with one bucket
>>> only). A significant unbalance (like a significant number of connections
>>> in very few buckets) might be detrimental for the sweeper, though (which
>>> would run for a long time in that case).
>>>
>>> Long-lived or a bulk transfer with a single connection should not be
>>> affected.
>>>
>>>>>> Do you have the original and reply traffic being handled by the same or
>>>>>> different PMD threads in your performance tests?  Also, a lot of 
>>>>>> different
>>>>>> connections is only one side of the performance.  Having a single 
>>>>>> traffic-heavy
>>>>>> connection where original and reply directions are handled by 2 
>>>>>> different PMD
>>>>>> threads would be an important test as well.  I think, it is even closer 
>>>>>> to the
>>>>>> real world cases. 
>>>>>>
>>>>> 
>>>>> Will be back on this and the v1/v4 + your snippet (which seems to be
>>>>> potentially a measurable improvement) once some numbers are available.
>>>>>
>>>
>>> Back on this and the single flow.
>>>
>>> We tested it and the numbers are not comparable with the previous, as we
>>> tested it on different hw (and slightly different setup), but the
>>> improvement we measured is around 5% on v4 and around 4% on v1 for short
>>> lived connections.
>>>
>>> No measurable improvement for long lived connections, which makes sense
>>> as it deals with much lower cps.
>>>
>>> Regarding the single connection, for the time being at least the test
>>> was perfomed with a TCP connection with different pmds (same goes for
>>> the previous tests) handling the connection and the performance look
>>> comparable. Gaetan's v4 performs slightly better:
>>>
>>>
>>> | series      | MSS 1400       | MSS 100        |
>>> |-------------+----------------+----------------|
>>> | ct-scale-v4 | 12.7 Gb/sec    | 1.60 Gb/sec    |
>>> | ct-buckets  | 12.6 Gb/sec    | 1.54 Gb/sec    |
>>> | baseline    | 11.9 Gb/sec    | 1.14 Gb/sec    |
>>>
>>> These numbers could be further confirmed using a more performing
>>> generator (if needed), as the one used was not very performing, but the
>>> expectation is that the numbers will be confirmed
>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>> 
>>
>> _______________________________________________
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to