Aaron Conole <acon...@redhat.com> writes: > Paolo Valerio <pvale...@redhat.com> writes: > >> Paolo Valerio <pvale...@redhat.com> writes: >> >>> Ilya Maximets <i.maxim...@ovn.org> writes: >>> >>>> On 6/20/22 23:57, Paolo Valerio wrote: >>>>> Ilya Maximets <i.maxim...@ovn.org> writes: >>>>> >>>>>> On 6/7/22 11:39, Robin Jarry wrote: >>>>>>> Paolo Valerio, Jun 05, 2022 at 19:37: >>>>>>>> Just a note that may be useful. >>>>>>>> After some tests, I noticed that establishing e.g. two TCP connections, >>>>>>>> and leaving the first one idle after 3whs, once the second connection >>>>>>>> expires (after moving to TIME_WAIT as a result of termination), the >>>>>>>> second doesn't get evicted until any event gets scheduled for the >>>>>>>> first. >>>>>>>> >>>>>>>> ovs-appctl dpctl/dump-conntrack -s >>>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED) >>>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT) >>>>>>>> >>>>>>>> This may be somewhat related to your results as during the >>>>>>>> test, the number of connections may reach the limit so apparently >>>>>>>> reducing >>>>>>>> the performances. >>>>>>> >>>>>>> Indeed, there was an issue in my test procedure. Due to the way T-Rex >>>>>>> generates connections, it is easy to fill the conntrack table after >>>>>>> a few iterations, making the test results inconsistent. >>>>>>> >>>>>>> Also, the flows which I had configured were not correct. There was an >>>>>>> extraneous action=NORMAL flow at the end. When the conntrack table is >>>>>>> full and a new packet cannot be tracked, it is marked as +trk+inv and >>>>>>> not dropped. This behaviour is specific to the userspace datapath. The >>>>>>> linux kernel datapath seems to drop the packet when it cannot be added >>>>>>> to connection tracking. >>>>>>> >>>>>>> Gaƫtan's series (v4) seems less resilient to the conntrack table being >>>>>>> full, especially when there is more than one PMD core. >>>>>>> >>>>>>> I have changed the t-rex script to allow running arbitrary commands in >>>>>>> between traffic iterations. This is leveraged to flush the conntrack >>>>>>> table and run each iteration in the same conditions. >>>>>>> >>>>>>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py >>>>>>> >>>>>>> To avoid filling the conntrack table, the max size was increased to 50M. >>>>>>> The DUT configuration can be summarized as the following: >>>>>>> >>>>>>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true >>>>>>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554" >>>>>>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev >>>>>>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \ >>>>>>> options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 >>>>>>> options:n_rxq_desc=4096 >>>>>>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \ >>>>>>> options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 >>>>>>> options:n_rxq_desc=4096 >>>>>>> ovs-appctl dpctl/ct-set-maxconns 50000000 >>>>>>> ovs-ofctl add-flow br0 >>>>>>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)" >>>>>>> ovs-ofctl add-flow br0 >>>>>>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL" >>>>>>> ovs-ofctl add-flow br0 >>>>>>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL" >>>>>>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop" >>>>>>> >>>>>>> Short Lived Connections >>>>>>> ----------------------- >>>>>>> >>>>>>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 >>>>>>> \ >>>>>>> --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \ >>>>>>> --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \ >>>>>>> --min-cps 10k --max-cps 600k >>>>>>> >>>>>>> ============== =============== ============== ============== =========== >>>>>>> Series Num. Flows CPS PPS BPS >>>>>>> ============== =============== ============== ============== =========== >>>>>>> Baseline 40.1K 79.3K/s 556Kp/s 347Mb/s >>>>>>> Gaetan v1 60.5K 121K/s 837Kp/s 522Mb/s >>>>>>> Gaetan v4 61.4K 122K/s 850Kp/s 530Mb/s >>>>>>> Paolo 377K 756K/s 5.3Mp/s 3.3Gb/s >>>>>>> ============== =============== ============== ============== =========== >>>>>>> >>>>>>> Even after fixing the test procedure, Paolo's series still performs >>>>>>> a lot better with short lived connections. >>>>>>> >>>>>>> Long Lived Connections >>>>>>> ---------------------- >>>>>>> >>>>>>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 >>>>>>> \ >>>>>>> --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \ >>>>>>> --udp-percent 1 --num-messages 500 --message-size 20 --server-wait >>>>>>> 50 \ >>>>>>> --min-cps 100 --max-cps 10k >>>>>>> >>>>>>> ============== =============== ============== ============== =========== >>>>>>> Series Num. Flows CPS PPS BPS >>>>>>> ============== =============== ============== ============== =========== >>>>>>> Baseline 17.4K 504/s 633Kp/s 422Mb/s >>>>>>> Gaetan v1 80.4K 3.1K/s 4.6Mp/s 3.0Gb/s >>>>>>> Gaetan v4 139K 5.4K/s 8.2Mp/s 5.4Gb/s >>>>>>> Paolo 132K 5.2K/s 7.7Mp/s 5.2Gb/s >>>>>>> ============== =============== ============== ============== =========== >>>>>>> >>>>>>> Thanks to Paolo for his help on this second round of tests. >>>>>> >>>>>> Hi, Robin and Paolo. >>>>>> >>>>> >>>>> Hello Ilya, >>>>> >>>>> thanks for the feedback about this. >>>>> >>>>>> Could you try the following change on top of Gaetan's v1 and v4: >>>>>> >>>>>> diff --git a/lib/conntrack.c b/lib/conntrack.c >>>>>> index aa4792f69..2c2203a4b 100644 >>>>>> --- a/lib/conntrack.c >>>>>> +++ b/lib/conntrack.c >>>>>> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct >>>>>> conn_key *key, >>>>>> bool found = false; >>>>>> >>>>>> CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) { >>>>>> - if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) >>>>>> { >>>>>> + if (conn_expired(conn, now)) { >>>>>> + continue; >>>>>> + } >>>>>> + if (!conn_key_cmp(&conn->key, key)) { >>>>>> found = true; >>>>>> if (reply) { >>>>>> *reply = false; >>>>>> } >>>>>> break; >>>>>> } >>>>>> - if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, >>>>>> now)) { >>>>>> + if (!conn_key_cmp(&conn->rev_key, key)) { >>>>>> found = true; >>>>>> if (reply) { >>>>>> *reply = true; >>>>>> --- >>>>>> >>>>>> ? >>>>>> >>>>>> Let me explain. As far as I can tell without actually running >>>>>> performance >>>>>> tests myself, the main issue with both variants of Gaetan's series is a >>>>>> slow >>>>>> cleanup of expired connections. >>>>>> >>>>>> v4, IMO, contains a serious bug, because the expiration list is treated >>>>>> at the >>>>>> sweep phase as if it was sorted by the expiration time. But that is not >>>>>> really >>>>>> true, because connections can not be re-ordered. They seem to stay in >>>>>> the exact >>>>>> same order as they were created, but expiration times are getting >>>>>> updated when >>>>>> packets are received. However, the sweep process stops at the moment >>>>>> the first >>>>>> non-expired connection is seen and we're not cleaning up later >>>>>> connections >>>>>> as reported by Paolo. Taking into account the fact that default timeout >>>>>> policy >>>>>> for established TCP connections appears to be 24 hours (or did I misread >>>>>> this?), >>>>>> closed connections can stay not cleaned up for that time. That seems >>>>>> like >>>>>> way too much. The point is, the whole queue should be inspected on >>>>>> every sweep >>>>>> iteration in order to re-sort it and actually move connections between >>>>>> different >>>>>> expiration lists to avoid silent established connections blocking closed >>>>>> ones. >>>>>> This will probably be slow (?). MPSC queue doesn't seem to be very >>>>>> suitable for >>>>>> such usecase. >>>>> >>>>> I agree with the cause of the issue, and about the MPSC suitability as >>>>> well unless they are used in different ways. >>>>> >>>>> Speaking of connections blocking other connections, there's a potential >>>>> problem in the current code and it is related to zones and timeout >>>>> policy. >>>>> >>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT) >>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT) >>>>> >>>>> In this case I set the timeout for tcp_end to 200 for zone 0, and 20 >>>>> seconds to zone 1. >>>>> The problem is less severe compared to the ct-scale series, but could >>>>> (more or less significantly, depends on the way timeouts are configured) >>>>> increase the latency of a connection in the same list. >>>> >>>> Yep. I noticed that issue too. It is a bug in the current implementation. >>>> Expiration lists should not be global, but per-zone, AFAICT. >>>> >>>>> >>>>>> >>>>>> v1. Here the situation is better since threads are kind of allowed to >>>>>> reschedule >>>>>> on their own, but via RCU thread. It looks like re-scheduled >>>>>> connections can be >>>>>> messed up between each other still though, because expiration can be >>>>>> updated >>>>>> multiple times, while re-insert is happening once. And also these >>>>>> remove+push_back >>>>> >>>>> which also makes me wonder if there's a chance of losing a timeout >>>>> transition. >>>>> >>>>>> operations for different threads will be additionally re-ordered by the >>>>>> RCU >>>>>> thread. So, the order is not really preserved here as well. In order >>>>>> to fix >>>>>> that we need to replace push_back operation with the search+insert, i.e. >>>>>> find the >>>>>> place where the connection should be inserted. And even that will not >>>>>> fully >>>>>> protect from the concurrent update of the expiration. OTOH, not fully >>>>>> sorted >>>>>> list may be not a huge problem, because connections are still moved >>>>>> between >>>>>> different expiration lists, so ESTABLISHED connection can not block >>>>>> clean up of >>>>>> the closed one. >>>>>> The RCU thread though can be a serious bottleneck in this concept. PMD >>>>>> threads >>>>>> are entering quiescent state only once per 1024 iterations or even more, >>>>>> so >>>>>> we can have 64K or more updates per thread per grace period. So, >>>>>> expired or >>>>>> closed connections can stay for a prolonged period of time. >>>>> >>>>> That's a good point. >>>>> >>>>>> >>>>>> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of >>>>>> elements >>>>>> in a hash map and lookup is likely slow because of this. Also, >>>>>> conn_key_cmp() >>>>>> is an extremely heavy operation, and we're performing it twice for every >>>>>> connection >>>>>> with the same hash. Paolo's series cheats here by quickly skipping or >>>>>> even >>>>>> cleaning up expired connection during the lookup. Gaetan's series' are >>>>>> not able >>>>>> to clean them, but can still quickly skip all expired ones without >>>>>> executing >>>>>> two costly comparisons. Hence the code snippet above. Previously, >>>>>> conn_expired >>>>>> check required the global lock, but it doesn't with any of the series >>>>>> applied. >>>>>> >>>>>> >>>>>> In conclusion, I don't think that we can efficiently use MPSC queue >>>>>> directly. >>>>>> At least, we need an ability to efficiently move connections between >>>>>> lists to >>>>>> avoid CLOSED connections being blocked by ESTABLISHED ones. One option >>>>>> is >>>>>> to always look through the whole list and re-insert every connection. >>>>>> It might >>>>>> be not a huge problem though. It was a problem when we had to hold the >>>>>> global >>>>>> mutex for that. If we don't need to hold the global mutex, we may allow >>>>>> that. >>>>> >>>>> Some thoughts/questions about it: >>>>> >>>>> does this require to go all the way through the list at once? >>>>> Probably it's not a problem, but in case of many connections >>>>> destruction/creation, isn't there a chance we keep going through the >>>>> list for a long time? >>>> >>>> I'm not sure if that is a problem or not, but I don't really like that >>>> implementation either way. Such extensive and awkward re-queueing will >>>> not look great. >>>> >>>>> >>>>> For sure it is the most relevant, but I wonder if this is only >>>>> applicable to the ESTABLISHED list. >>>>> I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts. >>>> >>>> I don't think it's specific to ESTABLISHED list. It's just more >>>> pronounced on this one, because of dramatic difference in timeout >>>> values. In general, entries in a list should all have the same >>>> timeout, otherwise we may fall into some nasty corner cases. >>>> >>> >>> I know it could not be ideal, but using more rcu_lists (during the >>> connection creation the conns get assigned in round robin to one lists >>> to have them balanced), not associated with any timeout, and scanning >>> them, without reordering, all the way at every reschedule time checking >>> for expiration only (atomically updated by the PMDs) should solve the >>> problems (even the one related to the zones). This approach would add an >>> extra cost to the sweeper, because it has to scan every time all the >>> connections, but push_{front,back} and event handling also come with a >>> non negligible cost. >>> >>> More rcu_lists may be needed because this would allow the sweeper to >>> yield the CPU (once exceeded the budget) and restart from the next list >>> in line the next round, if needed. I guess we could also reduce the >>> creation/destruction problem (which lead the sweeper to a long walk >>> through the lists) pushing front every new connection created. >>> >> >> Ilya, >> >> Just to confirm that v5 [1] has been tested multiple time with the same >> scripts written and used by Robin, and it performs on average slightly >> better in both scenario than v4, so the numbers reported by Robin are >> applicable to v5 as well. >> >> The issues that affected v4 are no longer present and it passes all the >> conntrack related unit tests. >> >> [1] https://patchwork.ozlabs.org/project/openvswitch/list/?series=307380 > > Just want to confirm that some of the order / performance discussed > issues aren't there any longer. Were there any noticeable CPU % > changes, especially w.r.t. pmd handler thread usage? I would expect > usage % to go up in both cases (since we won't be serialized around the > big ct lock any longer in some iterating cases). But I didn't see > anything about it (maybe I missed it).
Thank you Aaron. You didn't miss it, they were not included, unfortunately. I'd confirm your expectation as we previously spent a significant amount of time on the ct_lock especially considering it was needed for every connection update (other than insertion and sweeping). This was what came up from the previous analisys. > > Additionally, the latest timeout policy change was quite a bit easier to > reason about and to review. I think we can take some of the code > cleanup work from the buckets approach, but I guess the timeouts policy > approach might overall be easier to maintain for now, especially this > close to the branching point. > I agree about this latest change [1] exactly for the reason you said. [1] https://patchwork.ozlabs.org/project/openvswitch/list/?series=307380 >>>>> >>>>> Otherwise, >>>>> >>>>>> If the list itself is too long, we may just have normal lists and use >>>>>> MPSC >>>>>> queue to enqueue "update" events for the ct_sweep thread the same way as >>>>>> what >>>>>> is done for RCU thread in v1. Can be a possible solution? >>>>>> >>>>> >>>>> I guess this involve some reorder (any update could) inside the same >>>>> list. >>>>> >>>>> In such case we have to handle both the events (and reorder, if we >>>>> care), and the evictions in two separate moments. >>>> >>>> I'm not sure if the strict order is required. Relatively ordered >>>> list where all re-scheduled entries are always after non-rescheduled >>>> ones might be sufficient, if we're not looking for extreme accuracy. >>>> This should work fine, I guess, as long as we re-schedule everything >>>> that was updated within, let's say, a second, and as long as all >>>> connections in the same list has exactly the same timeout value. >>>> >>>>> >>>>>> rculist has a chance to be a viable option, would be great to look at >>>>>> performance >>>>>> numbers with the code snippet above. >>>>>> >>>>>> Paolo's series seems to not have any obvious issues, at least I didn't >>>>>> spot >>>>>> any on a quick glance. At least, until buckets are balanced. In some >>>>>> cases >>>>>> we may end up locking the same bucket from different threads frequently >>>>>> though. >>> >>> that's right. I suppose that this would affect the ability to sustain a >>> high rate of conns creation/destruction, as the locks are acquired in >>> that case. The short-lived connection test result would sink in that >>> case, but that should lead to numbers comparable to the other series, or >>> slightly worse (slightly worse is what I see recompiling with one bucket >>> only). A significant unbalance (like a significant number of connections >>> in very few buckets) might be detrimental for the sweeper, though (which >>> would run for a long time in that case). >>> >>> Long-lived or a bulk transfer with a single connection should not be >>> affected. >>> >>>>>> Do you have the original and reply traffic being handled by the same or >>>>>> different PMD threads in your performance tests? Also, a lot of >>>>>> different >>>>>> connections is only one side of the performance. Having a single >>>>>> traffic-heavy >>>>>> connection where original and reply directions are handled by 2 >>>>>> different PMD >>>>>> threads would be an important test as well. I think, it is even closer >>>>>> to the >>>>>> real world cases. >>>>>> >>>>> >>>>> Will be back on this and the v1/v4 + your snippet (which seems to be >>>>> potentially a measurable improvement) once some numbers are available. >>>>> >>> >>> Back on this and the single flow. >>> >>> We tested it and the numbers are not comparable with the previous, as we >>> tested it on different hw (and slightly different setup), but the >>> improvement we measured is around 5% on v4 and around 4% on v1 for short >>> lived connections. >>> >>> No measurable improvement for long lived connections, which makes sense >>> as it deals with much lower cps. >>> >>> Regarding the single connection, for the time being at least the test >>> was perfomed with a TCP connection with different pmds (same goes for >>> the previous tests) handling the connection and the performance look >>> comparable. Gaetan's v4 performs slightly better: >>> >>> >>> | series | MSS 1400 | MSS 100 | >>> |-------------+----------------+----------------| >>> | ct-scale-v4 | 12.7 Gb/sec | 1.60 Gb/sec | >>> | ct-buckets | 12.6 Gb/sec | 1.54 Gb/sec | >>> | baseline | 11.9 Gb/sec | 1.14 Gb/sec | >>> >>> These numbers could be further confirmed using a more performing >>> generator (if needed), as the one used was not very performing, but the >>> expectation is that the numbers will be confirmed >>> >>>>>> Thoughts? >>>>>> >>>>>> Best regards, Ilya Maximets. >>>>> >> >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev