Paolo Valerio <pvale...@redhat.com> writes: > Paolo Valerio <pvale...@redhat.com> writes: > >> Ilya Maximets <i.maxim...@ovn.org> writes: >> >>> On 6/20/22 23:57, Paolo Valerio wrote: >>>> Ilya Maximets <i.maxim...@ovn.org> writes: >>>> >>>>> On 6/7/22 11:39, Robin Jarry wrote: >>>>>> Paolo Valerio, Jun 05, 2022 at 19:37: >>>>>>> Just a note that may be useful. >>>>>>> After some tests, I noticed that establishing e.g. two TCP connections, >>>>>>> and leaving the first one idle after 3whs, once the second connection >>>>>>> expires (after moving to TIME_WAIT as a result of termination), the >>>>>>> second doesn't get evicted until any event gets scheduled for the first. >>>>>>> >>>>>>> ovs-appctl dpctl/dump-conntrack -s >>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED) >>>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT) >>>>>>> >>>>>>> This may be somewhat related to your results as during the >>>>>>> test, the number of connections may reach the limit so apparently >>>>>>> reducing >>>>>>> the performances. >>>>>> >>>>>> Indeed, there was an issue in my test procedure. Due to the way T-Rex >>>>>> generates connections, it is easy to fill the conntrack table after >>>>>> a few iterations, making the test results inconsistent. >>>>>> >>>>>> Also, the flows which I had configured were not correct. There was an >>>>>> extraneous action=NORMAL flow at the end. When the conntrack table is >>>>>> full and a new packet cannot be tracked, it is marked as +trk+inv and >>>>>> not dropped. This behaviour is specific to the userspace datapath. The >>>>>> linux kernel datapath seems to drop the packet when it cannot be added >>>>>> to connection tracking. >>>>>> >>>>>> Gaƫtan's series (v4) seems less resilient to the conntrack table being >>>>>> full, especially when there is more than one PMD core. >>>>>> >>>>>> I have changed the t-rex script to allow running arbitrary commands in >>>>>> between traffic iterations. This is leveraged to flush the conntrack >>>>>> table and run each iteration in the same conditions. >>>>>> >>>>>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py >>>>>> >>>>>> To avoid filling the conntrack table, the max size was increased to 50M. >>>>>> The DUT configuration can be summarized as the following: >>>>>> >>>>>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true >>>>>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554" >>>>>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev >>>>>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \ >>>>>> options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 >>>>>> options:n_rxq_desc=4096 >>>>>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \ >>>>>> options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 >>>>>> options:n_rxq_desc=4096 >>>>>> ovs-appctl dpctl/ct-set-maxconns 50000000 >>>>>> ovs-ofctl add-flow br0 >>>>>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)" >>>>>> ovs-ofctl add-flow br0 >>>>>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL" >>>>>> ovs-ofctl add-flow br0 >>>>>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL" >>>>>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop" >>>>>> >>>>>> Short Lived Connections >>>>>> ----------------------- >>>>>> >>>>>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \ >>>>>> --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \ >>>>>> --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \ >>>>>> --min-cps 10k --max-cps 600k >>>>>> >>>>>> ============== =============== ============== ============== =========== >>>>>> Series Num. Flows CPS PPS BPS >>>>>> ============== =============== ============== ============== =========== >>>>>> Baseline 40.1K 79.3K/s 556Kp/s 347Mb/s >>>>>> Gaetan v1 60.5K 121K/s 837Kp/s 522Mb/s >>>>>> Gaetan v4 61.4K 122K/s 850Kp/s 530Mb/s >>>>>> Paolo 377K 756K/s 5.3Mp/s 3.3Gb/s >>>>>> ============== =============== ============== ============== =========== >>>>>> >>>>>> Even after fixing the test procedure, Paolo's series still performs >>>>>> a lot better with short lived connections. >>>>>> >>>>>> Long Lived Connections >>>>>> ---------------------- >>>>>> >>>>>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \ >>>>>> --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \ >>>>>> --udp-percent 1 --num-messages 500 --message-size 20 --server-wait >>>>>> 50 \ >>>>>> --min-cps 100 --max-cps 10k >>>>>> >>>>>> ============== =============== ============== ============== =========== >>>>>> Series Num. Flows CPS PPS BPS >>>>>> ============== =============== ============== ============== =========== >>>>>> Baseline 17.4K 504/s 633Kp/s 422Mb/s >>>>>> Gaetan v1 80.4K 3.1K/s 4.6Mp/s 3.0Gb/s >>>>>> Gaetan v4 139K 5.4K/s 8.2Mp/s 5.4Gb/s >>>>>> Paolo 132K 5.2K/s 7.7Mp/s 5.2Gb/s >>>>>> ============== =============== ============== ============== =========== >>>>>> >>>>>> Thanks to Paolo for his help on this second round of tests. >>>>> >>>>> Hi, Robin and Paolo. >>>>> >>>> >>>> Hello Ilya, >>>> >>>> thanks for the feedback about this. >>>> >>>>> Could you try the following change on top of Gaetan's v1 and v4: >>>>> >>>>> diff --git a/lib/conntrack.c b/lib/conntrack.c >>>>> index aa4792f69..2c2203a4b 100644 >>>>> --- a/lib/conntrack.c >>>>> +++ b/lib/conntrack.c >>>>> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct >>>>> conn_key *key, >>>>> bool found = false; >>>>> >>>>> CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) { >>>>> - if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) { >>>>> + if (conn_expired(conn, now)) { >>>>> + continue; >>>>> + } >>>>> + if (!conn_key_cmp(&conn->key, key)) { >>>>> found = true; >>>>> if (reply) { >>>>> *reply = false; >>>>> } >>>>> break; >>>>> } >>>>> - if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, >>>>> now)) { >>>>> + if (!conn_key_cmp(&conn->rev_key, key)) { >>>>> found = true; >>>>> if (reply) { >>>>> *reply = true; >>>>> --- >>>>> >>>>> ? >>>>> >>>>> Let me explain. As far as I can tell without actually running performance >>>>> tests myself, the main issue with both variants of Gaetan's series is a >>>>> slow >>>>> cleanup of expired connections. >>>>> >>>>> v4, IMO, contains a serious bug, because the expiration list is treated >>>>> at the >>>>> sweep phase as if it was sorted by the expiration time. But that is not >>>>> really >>>>> true, because connections can not be re-ordered. They seem to stay in >>>>> the exact >>>>> same order as they were created, but expiration times are getting updated >>>>> when >>>>> packets are received. However, the sweep process stops at the moment the >>>>> first >>>>> non-expired connection is seen and we're not cleaning up later connections >>>>> as reported by Paolo. Taking into account the fact that default timeout >>>>> policy >>>>> for established TCP connections appears to be 24 hours (or did I misread >>>>> this?), >>>>> closed connections can stay not cleaned up for that time. That seems like >>>>> way too much. The point is, the whole queue should be inspected on every >>>>> sweep >>>>> iteration in order to re-sort it and actually move connections between >>>>> different >>>>> expiration lists to avoid silent established connections blocking closed >>>>> ones. >>>>> This will probably be slow (?). MPSC queue doesn't seem to be very >>>>> suitable for >>>>> such usecase. >>>> >>>> I agree with the cause of the issue, and about the MPSC suitability as >>>> well unless they are used in different ways. >>>> >>>> Speaking of connections blocking other connections, there's a potential >>>> problem in the current code and it is related to zones and timeout >>>> policy. >>>> >>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT) >>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT) >>>> >>>> In this case I set the timeout for tcp_end to 200 for zone 0, and 20 >>>> seconds to zone 1. >>>> The problem is less severe compared to the ct-scale series, but could >>>> (more or less significantly, depends on the way timeouts are configured) >>>> increase the latency of a connection in the same list. >>> >>> Yep. I noticed that issue too. It is a bug in the current implementation. >>> Expiration lists should not be global, but per-zone, AFAICT. >>> >>>> >>>>> >>>>> v1. Here the situation is better since threads are kind of allowed to >>>>> reschedule >>>>> on their own, but via RCU thread. It looks like re-scheduled connections >>>>> can be >>>>> messed up between each other still though, because expiration can be >>>>> updated >>>>> multiple times, while re-insert is happening once. And also these >>>>> remove+push_back >>>> >>>> which also makes me wonder if there's a chance of losing a timeout >>>> transition. >>>> >>>>> operations for different threads will be additionally re-ordered by the >>>>> RCU >>>>> thread. So, the order is not really preserved here as well. In order to >>>>> fix >>>>> that we need to replace push_back operation with the search+insert, i.e. >>>>> find the >>>>> place where the connection should be inserted. And even that will not >>>>> fully >>>>> protect from the concurrent update of the expiration. OTOH, not fully >>>>> sorted >>>>> list may be not a huge problem, because connections are still moved >>>>> between >>>>> different expiration lists, so ESTABLISHED connection can not block clean >>>>> up of >>>>> the closed one. >>>>> The RCU thread though can be a serious bottleneck in this concept. PMD >>>>> threads >>>>> are entering quiescent state only once per 1024 iterations or even more, >>>>> so >>>>> we can have 64K or more updates per thread per grace period. So, >>>>> expired or >>>>> closed connections can stay for a prolonged period of time. >>>> >>>> That's a good point. >>>> >>>>> >>>>> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of >>>>> elements >>>>> in a hash map and lookup is likely slow because of this. Also, >>>>> conn_key_cmp() >>>>> is an extremely heavy operation, and we're performing it twice for every >>>>> connection >>>>> with the same hash. Paolo's series cheats here by quickly skipping or >>>>> even >>>>> cleaning up expired connection during the lookup. Gaetan's series' are >>>>> not able >>>>> to clean them, but can still quickly skip all expired ones without >>>>> executing >>>>> two costly comparisons. Hence the code snippet above. Previously, >>>>> conn_expired >>>>> check required the global lock, but it doesn't with any of the series >>>>> applied. >>>>> >>>>> >>>>> In conclusion, I don't think that we can efficiently use MPSC queue >>>>> directly. >>>>> At least, we need an ability to efficiently move connections between >>>>> lists to >>>>> avoid CLOSED connections being blocked by ESTABLISHED ones. One option is >>>>> to always look through the whole list and re-insert every connection. It >>>>> might >>>>> be not a huge problem though. It was a problem when we had to hold the >>>>> global >>>>> mutex for that. If we don't need to hold the global mutex, we may allow >>>>> that. >>>> >>>> Some thoughts/questions about it: >>>> >>>> does this require to go all the way through the list at once? >>>> Probably it's not a problem, but in case of many connections >>>> destruction/creation, isn't there a chance we keep going through the >>>> list for a long time? >>> >>> I'm not sure if that is a problem or not, but I don't really like that >>> implementation either way. Such extensive and awkward re-queueing will >>> not look great. >>> >>>> >>>> For sure it is the most relevant, but I wonder if this is only >>>> applicable to the ESTABLISHED list. >>>> I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts. >>> >>> I don't think it's specific to ESTABLISHED list. It's just more >>> pronounced on this one, because of dramatic difference in timeout >>> values. In general, entries in a list should all have the same >>> timeout, otherwise we may fall into some nasty corner cases. >>> >> >> I know it could not be ideal, but using more rcu_lists (during the >> connection creation the conns get assigned in round robin to one lists >> to have them balanced), not associated with any timeout, and scanning >> them, without reordering, all the way at every reschedule time checking >> for expiration only (atomically updated by the PMDs) should solve the >> problems (even the one related to the zones). This approach would add an >> extra cost to the sweeper, because it has to scan every time all the >> connections, but push_{front,back} and event handling also come with a >> non negligible cost. >> >> More rcu_lists may be needed because this would allow the sweeper to >> yield the CPU (once exceeded the budget) and restart from the next list >> in line the next round, if needed. I guess we could also reduce the >> creation/destruction problem (which lead the sweeper to a long walk >> through the lists) pushing front every new connection created. >> > > Ilya, > > Just to confirm that v5 [1] has been tested multiple time with the same > scripts written and used by Robin, and it performs on average slightly > better in both scenario than v4, so the numbers reported by Robin are > applicable to v5 as well. > > The issues that affected v4 are no longer present and it passes all the > conntrack related unit tests. > > [1] https://patchwork.ozlabs.org/project/openvswitch/list/?series=307380
Just want to confirm that some of the order / performance discussed issues aren't there any longer. Were there any noticeable CPU % changes, especially w.r.t. pmd handler thread usage? I would expect usage % to go up in both cases (since we won't be serialized around the big ct lock any longer in some iterating cases). But I didn't see anything about it (maybe I missed it). Additionally, the latest timeout policy change was quite a bit easier to reason about and to review. I think we can take some of the code cleanup work from the buckets approach, but I guess the timeouts policy approach might overall be easier to maintain for now, especially this close to the branching point. >>>> >>>> Otherwise, >>>> >>>>> If the list itself is too long, we may just have normal lists and use MPSC >>>>> queue to enqueue "update" events for the ct_sweep thread the same way as >>>>> what >>>>> is done for RCU thread in v1. Can be a possible solution? >>>>> >>>> >>>> I guess this involve some reorder (any update could) inside the same >>>> list. >>>> >>>> In such case we have to handle both the events (and reorder, if we >>>> care), and the evictions in two separate moments. >>> >>> I'm not sure if the strict order is required. Relatively ordered >>> list where all re-scheduled entries are always after non-rescheduled >>> ones might be sufficient, if we're not looking for extreme accuracy. >>> This should work fine, I guess, as long as we re-schedule everything >>> that was updated within, let's say, a second, and as long as all >>> connections in the same list has exactly the same timeout value. >>> >>>> >>>>> rculist has a chance to be a viable option, would be great to look at >>>>> performance >>>>> numbers with the code snippet above. >>>>> >>>>> Paolo's series seems to not have any obvious issues, at least I didn't >>>>> spot >>>>> any on a quick glance. At least, until buckets are balanced. In some >>>>> cases >>>>> we may end up locking the same bucket from different threads frequently >>>>> though. >> >> that's right. I suppose that this would affect the ability to sustain a >> high rate of conns creation/destruction, as the locks are acquired in >> that case. The short-lived connection test result would sink in that >> case, but that should lead to numbers comparable to the other series, or >> slightly worse (slightly worse is what I see recompiling with one bucket >> only). A significant unbalance (like a significant number of connections >> in very few buckets) might be detrimental for the sweeper, though (which >> would run for a long time in that case). >> >> Long-lived or a bulk transfer with a single connection should not be >> affected. >> >>>>> Do you have the original and reply traffic being handled by the same or >>>>> different PMD threads in your performance tests? Also, a lot of different >>>>> connections is only one side of the performance. Having a single >>>>> traffic-heavy >>>>> connection where original and reply directions are handled by 2 different >>>>> PMD >>>>> threads would be an important test as well. I think, it is even closer >>>>> to the >>>>> real world cases. >>>>> >>>> >>>> Will be back on this and the v1/v4 + your snippet (which seems to be >>>> potentially a measurable improvement) once some numbers are available. >>>> >> >> Back on this and the single flow. >> >> We tested it and the numbers are not comparable with the previous, as we >> tested it on different hw (and slightly different setup), but the >> improvement we measured is around 5% on v4 and around 4% on v1 for short >> lived connections. >> >> No measurable improvement for long lived connections, which makes sense >> as it deals with much lower cps. >> >> Regarding the single connection, for the time being at least the test >> was perfomed with a TCP connection with different pmds (same goes for >> the previous tests) handling the connection and the performance look >> comparable. Gaetan's v4 performs slightly better: >> >> >> | series | MSS 1400 | MSS 100 | >> |-------------+----------------+----------------| >> | ct-scale-v4 | 12.7 Gb/sec | 1.60 Gb/sec | >> | ct-buckets | 12.6 Gb/sec | 1.54 Gb/sec | >> | baseline | 11.9 Gb/sec | 1.14 Gb/sec | >> >> These numbers could be further confirmed using a more performing >> generator (if needed), as the one used was not very performing, but the >> expectation is that the numbers will be confirmed >> >>>>> Thoughts? >>>>> >>>>> Best regards, Ilya Maximets. >>>> > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev