Ilya Maximets <i.maxim...@ovn.org> writes: > On 5/5/22 05:52, Michael Santana wrote: >> On Mon, Apr 25, 2022 at 2:32 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >>> >>> On 4/19/22 05:19, Michael Santana wrote: >>>> On Tue, Apr 5, 2022 at 6:32 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >>>>> >>>>> On 4/4/22 14:36, Michael Santana wrote: >>>>>> The handler and CPU mapping in upcalls are incorrect, and this is >>>>>> specially noticeable systems with cpu isolation enabled. >>>>>> >>>>>> Say we have a 12 core system where only every even number CPU is enabled >>>>>> C0, C2, C4, C6, C8, C10 >>>>>> >>>>>> This means we will create an array of size 6 that will be sent to >>>>>> kernel that is populated with sockets [S0, S1, S2, S3, S4, S5] >>>>>> >>>>>> The problem is when the kernel does an upcall it checks the socket array >>>>>> via the index of the CPU, effectively adding additional load on some >>>>>> CPUs while leaving no work on other CPUs. >>>>>> >>>>>> e.g. >>>>>> >>>>>> C0 indexes to S0 >>>>>> C2 indexes to S2 (should be S1) >>>>>> C4 indexes to S4 (should be S2) >>>>>> >>>>>> Modulo of 6 (size of socket array) is applied, so we wrap back to S0 >>>>>> C6 indexes to S0 (should be S3) >>>>>> C8 indexes to S2 (should be S4) >>>>>> C10 indexes to S4 (should be S5) >>>>>> >>>>>> Effectively sockets S0, S2, S4 get overloaded while sockets S1, S3, S5 >>>>>> get no work assigned to them >>>>>> >>>>>> This leads to the kernel to throw the following message: >>>>>> "openvswitch: cpu_id mismatch with handler threads" >>>>>> >>>>>> To fix this we send the kernel a corrected array of sockets the size >>>>>> of all CPUs in the system. In the above example we would create a >>>>>> corrected array as follows: >>>>>> [S0, S1, S1, S2, S2, S3, S3, S4, S4, S5, S5, S0] >>>>>> >>>>>> This guarantees that regardless of which CPU a packet comes in the kernel >>>>>> will correctly map it to the correct socket >>>>> >>>>> Hey, Michael. Thanks for working on this issue! >>>>> >>>>> I agree that the array is too short and ovs-vswitchd has to supply >>>>> a full array with values for each physical core. >>>>> This can be even a separate patch with just filling in the array >>>>> for each physical core in a round-robin manner. >>>>> >>>>> However, I'm not sure it is possible to predict what the good >>>>> distribution of sockets among cores should look like. >>>>> >>>>> Taking your current implementation as an example. I'm assuming >>>>> that suggested distribution (socket duplication for neighboring >>>>> CPU cores) can be good for a system with 2 NUMA nodes and cores >>>>> enumerated as odd - NUMA0, even - NUMA1. But the enumeration >>>>> scheme of CPU cores is not set in stone, most multi-NUMA systems >>>>> has a BIOS configuration knob to change the enumeration method. >>>>> One of the most popular ones is: lower cores - NUMA0, higher cores >>>>> - NUMA1. In this case, since interrupts from a NIC are likely >>>>> arriving on the cores from the same NUMA, sockets S1-S3 will >>>>> carry all the traffic, while sockets S4-S5 will remain unused. >>>>> >>>>> Another point is that we don't really know how many irq vectors >>>>> are registered for NICs and what is the irq affinity for them. >>>>> IOW, we don't know on which cores actual interrupts will be >>>>> handled and on which sockets they will end up. IRQ affinity >>>>> and number of queues are also dynamic and can be changed over >>>>> time. >>>>> >>>>> So, AFAICT, regardless of the positioning of the sockets in the >>>>> array, there will be very good and very bad cases for it with >>>>> more or less same probability. >>>>> >>>>> It might be that the only option to guarantee more or less even >>>>> load distribution is to always create N handlers, where the N >>>>> is the number of actual physical cores. And let the scheduler >>>>> and in-kernel IRQ affinity to deal with load balancing. >>>>> Yes, we're adding an extra overhead in case of limited CPU >>>>> affinity for the ovs-vswitchd, but OTOH if we have high volume >>>>> of traffic received on big number of CPU cores, while OVS is >>>>> limited to a single core, that sounds like a misconfiguration. >>>>> Some performance testing is required, for sure. >>>>> >>>>> Thoughts? >>>> >>>> >>>> Thanks for the feedback. There are a couple of points I would like to >>>> make regarding your suggestions and a couple clarifications I would like >>>> to make >>>> (aaron++ thanks for helping me draft this) >>>> >>>> In this patch, when we pass the array of port ids to the kernel we >>>> fill the entire array with valid port ids, as such we create an array >>>> that is the size of the number of CPUs on the system. >>>> >>>> The question now becomes how do we handle isolated cores. There >>>> shouldn't be any packet processing occurring on any isolated core >>>> since the cores are supposedly isolated. The way the current code >>>> handles isolated cores is stated in the original commit message. >>>> Moving forward we have two choices to deal with the isolated cores. 1. >>>> fill the array with handler portids to allow upcalls to pass, 2. fill >>>> the array with '0' on the isolated cores to drop them. We chose >>>> option 1 so that upcalls will continue to pass. >>>> >>>> We decided to go with 1 _just in case_, however I do understand if >>>> option 2 is preferable. If you think option 2 is better then it's not >>>> a problem and we can move forward with this >>> >>> Dropping packets is not an option, so we need to fill all the values >>> in the array with correct handler ids. >>> >>>> >>>> >>>> The alternatives you have suggested might not be good solutions either >>>> for the following reasons >>>> >>>> Handlers are not fixed to a single core. They have affinity for all >>>> active cores. They can float along all numa nodes any way. They are >>>> not pinned to a _single_ core regardless of isolations set up. e.g. >>>> [root@wsfd-netdev60 ~]# taskset -p 396826 >>>> pid 396826's current affinity mask: 555555 >>> >>> I didn't work with the kernel scheduler for quiet a bit now, but >>> IIRC, it tries to wake up threads on cores that shares the same >>> LLC with the one where initial kernel handling was performed, and >>> it will also generally balance the load by moving preempted threads >>> to available cores. So, the actual cores on which threads are waking >>> up is not our concern. We just need to make sure that different >>> cores are not trying to wake up the same thread simultaneously. >> Why do we care if two cores (I assume one inactive core and one active >> core) wake up the same thread? > > Because "inactive" doesn't mean that we can't receive traffic on them, > see below. > >> >> I assume the thread will eventually deal with the upcall of both >> cores. Yes, performance will be worse than if you had two dedicated >> threads for these cores. But remember that in this hypothetical >> scenario one of the cores is supposed to be inactive. There isn't >> supposed to be any traffic going on this inactive core. We chose the >> schema in this patch because it makes sure each active core gets a >> distinct handler thread. This schema also has a nice side effect that >> as a 'best effort' deals with packets on cores that are inactive where >> there shouldn't be any traffic, just in case there is traffic. We >> don't know how many packets will go through inactive cores. I would >> imagine none. >> >> But you are asking for something that's beyond 'best effort' to deal >> with upcalls on inactive cores. You are asking for 'best performance' >> to deal with upcalls on inactive cores. I don't think 'best >> performance' is necessary for inactive cores, because again there >> shouldn't be traffic on these inactive cores in the first place. > > I think there is a misunderstanding. "inactive" cores can and will > receive the traffic. We're talking about cores not being available > for ovs-vswitchd userpsace process, but they can still be available > for the kernel and NIC irqs. For sure, there are cases where cores > are fully isolated and kernel doesn't allow any work to be done on > these cores, unless explicitly asked by a user. But that's not a > common case. From my experience, modern setups tend to dynamically > limit CPU resources for userspace processes with tuned or cgroups, > while kernel is free to use cores as it wants. > > So, let's look at your example (but with 5 cores): > > c0 c1 c2 c3 c4 c5 > > Let's say ovs-vswitchd has CPU affinity limited to cores 0, 2 and 4. > With the current patch socket distribution will look like: > > c0 c1 c2 c3 c4 c5 > s0 s1 s1 s2 s2 s0 > > Now, if the IRQ affinity for 2 Rx queues of some network card is set > to cores 1 and 2, the thread servicing socket s1 will have to handle > traffic received on both Rx queues. It's also likely that handling > interrupts on a different core from where the handler thread is > has some performance benefits, as upcalls should not really depend on > cache locality of the packet data (cache locality of userspace structures > is more important). > > And that is the case where 2 cores are trying to wake up the same > userpsace thread while other threads are not loaded. > > And here goes my point that regardless of the way we filling up the > array, there will always be such worst case scenario that we can > hit easily. > >> Creating inactive handler threads will create more overhead > > That is true, but as I said, that is something we need to test and > measure. > >> and we >> will need to modify the code that determines the number of threads to >> always be the total number of installed cores in your round-robin >> solution. > > Round-robin solution doesn't require number of threads to be equal to > number of cores. If they are equal, there is no "round-robin" component > as every thread/socket will be chosen only once, i.e. 1:1 mapping. > > The point was that round-robin solution is not worse than any other > distribution of sockets between cores. To be clear, round-robin: > > c0 c1 c2 c3 c4 c5 > s0 s1 s2 s0 s1 s2 > > It's basically the same as what we have right now, but without warnings > from the kernel. This will not solve imbalance for the case in the > commit message, but the solution proposed in this patch will not solve > imbalance for some other cases and will make imbalance worse for systems > with different core numbering scheme. > > > On the other hand, the solution to always create threads by the total > number of cores is the only solution that may solve the imbalance in > a general case, since there are no corner cases where the same thread > is servicing packets from more than one core. But this solution will > have a scheduling overhead in case many threads are packed on small > number of cores, but, again, this overhead needs to be tested. > > For the code changes, you already have a function to count the total > number of cores, it doesn't look like we'll need a lot of code > modifications to change the number of created threads.
This presupposes that user is okay with OvS creating those extra threads. Maybe that's okay - I guess the only thing the user cares about is that we don't run on cores that aren't taskset. Unfortunately the per-cpu dispatch eliminated the n-handler-threads configurable making the assumption that the isolated CPUs are the only threads we want active. This might be an incorrect assumption on our part. I think it makes sense to do round-robin and then we can also have a configuration for running with n-handler-threads == n_cores (in which case, round-robin will make 1:1 mapping). Does this sound right? I want to make sure to cover the concerns. >> That code would also need to change so that the validators >> stay a proportion of active cores. > > That's true, we don't want revalidators to compete for CPU resources. > >> I think it is more effort for >> diminishing returns >> >> >> >> With all that said, if you still think that 'best performance' is the >> way to go we can absolutely go down that route. I can get started with >> a V2 using your proposed round-robin solution as well as the >> additional code change for number of handlers+validators. >> >> Just let me know if you prefer your round-robin solution. Or if you >> would like to proceed with the schema proposed in this patch. Either >> way, let me know your thoughts >> >> Best, >> Michael Santana >> >> >> >>> >>> My point is that any clever mapping schema will have cases where >>> multiple cores are trying to wake up the same thread. So, we could >>> as well fill the array in a round-robin way without thinking too much. >>> >>> The only schema where waking up the same thread by different cores >>> can not happen is a schema where all the values in the array are >>> different, i.e. number of threads is equal to the total number of >>> CPU cores. But yes, that leads up to the question about efficiency >>> of such solution. >>> >>>> >>>> More handlers than necessary for all cores hurts performance. We >>>> probably don't want to create handlers for cores that are supposed to >>>> be inactive. Since they have the affinity to go on any active core >>>> they will just compete with other handlers for active cores. >>> >>> Inactive threads (threads, created for cores where packets are >>> never received) will always sleep. Sleeping threads will add a slight >>> kernel scheduling overhead, but it's negligible in our case, since >>> number of threads is fairly low. >>> >>> If you have 100 cores, but only 4 of them are available to OVS, >>> that might be a problem, but only if we have a high upcall rate >>> in kernel on more than 4 cores. However, in that case, our poor >>> 4 cores will likely not be able to handle these upcalls anyway >>> and more cores will need to be made available to OVS. >>> >>> High concurrency between handler threads can also be mitigated in some >>> cases with irq affinity for NICs, i.e. limiting incoming traffic to lower >>> number of CPU cores if high volume of upcalls is expected. >>> And there are no clear workarounds for cases where bad distribution is >>> a result of a "clever" mapping, as this mapping is generally unknown >>> to the end user. >>> >>> In any case, I think, concurrency overhead has to be tested. >>> It's not obvious that it should be large. >>> >>>> What's >>>> more is that with an increased number of handlers we also have an >>>> increased number of validators. This would add additional unnecessary >>>> load and contention >>> >>> Is it necessary for number of revalidators to depend on the number >>> of handlers? Can we still count them using the number of actually >>> available CPU cores? >>> >>>> >>>> Additionally, this is the slowpath, we try to process 'quickly' but >>>> cross-numa slowdown probably isn't as big of a deal here (because the >>>> packet needs to go to userspace and get processed via the ofproto >>>> pipeline anyway, which is a much slower operation overall). >>> >>> I don't think that cross-NUMA slowdown is a concern. Kernel scheduler >>> will try to migrate threads if possible anyway. Dragging the same >>> thread back and forth between two different NUMA nodes might >>> be a larger problem. But I agree, that it's likely not a big deal >>> for a handler thread. >>> >>>> >>>>> >>>>> Best regards, Ilya Maximets. >>>>> >>>> >>> >> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev