Ilya Maximets <i.maxim...@ovn.org> writes:

> On 5/5/22 05:52, Michael Santana wrote:
>> On Mon, Apr 25, 2022 at 2:32 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>
>>> On 4/19/22 05:19, Michael Santana wrote:
>>>> On Tue, Apr 5, 2022 at 6:32 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>>>
>>>>> On 4/4/22 14:36, Michael Santana wrote:
>>>>>> The handler and CPU mapping in upcalls are incorrect, and this is
>>>>>> specially noticeable systems with cpu isolation enabled.
>>>>>>
>>>>>> Say we have a 12 core system where only every even number CPU is enabled
>>>>>> C0, C2, C4, C6, C8, C10
>>>>>>
>>>>>> This means we will create an array of size 6 that will be sent to
>>>>>> kernel that is populated with sockets [S0, S1, S2, S3, S4, S5]
>>>>>>
>>>>>> The problem is when the kernel does an upcall it checks the socket array
>>>>>> via the index of the CPU, effectively adding additional load on some
>>>>>> CPUs while leaving no work on other CPUs.
>>>>>>
>>>>>> e.g.
>>>>>>
>>>>>> C0  indexes to S0
>>>>>> C2  indexes to S2 (should be S1)
>>>>>> C4  indexes to S4 (should be S2)
>>>>>>
>>>>>> Modulo of 6 (size of socket array) is applied, so we wrap back to S0
>>>>>> C6  indexes to S0 (should be S3)
>>>>>> C8  indexes to S2 (should be S4)
>>>>>> C10 indexes to S4 (should be S5)
>>>>>>
>>>>>> Effectively sockets S0, S2, S4 get overloaded while sockets S1, S3, S5
>>>>>> get no work assigned to them
>>>>>>
>>>>>> This leads to the kernel to throw the following message:
>>>>>> "openvswitch: cpu_id mismatch with handler threads"
>>>>>>
>>>>>> To fix this we send the kernel a corrected array of sockets the size
>>>>>> of all CPUs in the system. In the above example we would create a
>>>>>> corrected array as follows:
>>>>>> [S0, S1, S1, S2, S2, S3, S3, S4, S4, S5, S5, S0]
>>>>>>
>>>>>> This guarantees that regardless of which CPU a packet comes in the kernel
>>>>>> will correctly map it to the correct socket
>>>>>
>>>>> Hey, Michael.  Thanks for working on this issue!
>>>>>
>>>>> I agree that the array is too short and ovs-vswitchd has to supply
>>>>> a full array with values for each physical core.
>>>>> This can be even a separate patch with just filling in the array
>>>>> for each physical core in a round-robin manner.
>>>>>
>>>>> However, I'm not sure it is possible to predict what the good
>>>>> distribution of sockets among cores should look like.
>>>>>
>>>>> Taking your current implementation as an example.  I'm assuming
>>>>> that suggested distribution (socket duplication for neighboring
>>>>> CPU cores) can be good for a system with 2 NUMA nodes and cores
>>>>> enumerated as odd - NUMA0, even - NUMA1.  But the enumeration
>>>>> scheme of CPU cores is not set in stone, most multi-NUMA systems
>>>>> has a BIOS configuration knob to change the enumeration method.
>>>>> One of the most popular ones is: lower cores - NUMA0, higher cores
>>>>> - NUMA1.  In this case, since interrupts from a NIC are likely
>>>>> arriving on the cores from the same NUMA, sockets S1-S3 will
>>>>> carry all the traffic, while sockets S4-S5 will remain unused.
>>>>>
>>>>> Another point is that we don't really know how many irq vectors
>>>>> are registered for NICs and what is the irq affinity for them.
>>>>> IOW, we don't know on which cores actual interrupts will be
>>>>> handled and on which sockets they will end up.  IRQ affinity
>>>>> and number of queues are also dynamic and can be changed over
>>>>> time.
>>>>>
>>>>> So, AFAICT, regardless of the positioning of the sockets in the
>>>>> array, there will be very good and very bad cases for it with
>>>>> more or less same probability.
>>>>>
>>>>> It might be that the only option to guarantee more or less even
>>>>> load distribution is to always create N handlers, where the N
>>>>> is the number of actual physical cores.  And let the scheduler
>>>>> and in-kernel IRQ affinity to deal with load balancing.
>>>>> Yes, we're adding an extra overhead in case of limited CPU
>>>>> affinity for the ovs-vswitchd, but OTOH if we have high volume
>>>>> of traffic received on big number of CPU cores, while OVS is
>>>>> limited to a single core, that sounds like a misconfiguration.
>>>>> Some performance testing is required, for sure.
>>>>>
>>>>> Thoughts?
>>>>
>>>>
>>>> Thanks for the feedback. There are a couple of points I would like to
>>>> make regarding your suggestions and a couple clarifications I would like 
>>>> to make
>>>> (aaron++ thanks for helping me draft this)
>>>>
>>>> In this patch, when we pass the array of port ids to the kernel we
>>>> fill the entire array with valid port ids, as such we create an array
>>>> that is the size of the number of CPUs on the system.
>>>>
>>>> The question now becomes how do we handle isolated cores. There
>>>> shouldn't be any packet processing occurring on any isolated core
>>>> since the cores are supposedly isolated. The way the current code
>>>> handles isolated cores is stated in the original commit message.
>>>> Moving forward we have two choices to deal with the isolated cores. 1.
>>>> fill the array with handler portids to allow upcalls to pass, 2. fill
>>>> the array with '0' on the isolated cores to drop them.  We chose
>>>> option 1 so that upcalls will continue to pass.
>>>>
>>>> We decided to go with 1 _just in case_, however I do understand if
>>>> option 2 is preferable. If you think option 2 is better then it's not
>>>> a problem and we can move forward with this
>>>
>>> Dropping packets is not an option, so we need to fill all the values
>>> in the array with correct handler ids.
>>>
>>>>
>>>>
>>>> The alternatives you have suggested might not be good solutions either
>>>> for the following reasons
>>>>
>>>> Handlers are not fixed to a single core. They have affinity for all
>>>> active cores. They can float along all numa nodes any way. They are
>>>> not pinned to a _single_ core regardless of isolations set up. e.g.
>>>> [root@wsfd-netdev60 ~]# taskset -p 396826
>>>> pid 396826's current affinity mask: 555555
>>>
>>> I didn't work with the kernel scheduler for quiet a bit now, but
>>> IIRC, it tries to wake up threads on cores that shares the same
>>> LLC with the one where initial kernel handling was performed, and
>>> it will also generally balance the load by moving preempted threads
>>> to available cores.  So, the actual cores on which threads are waking
>>> up is not our concern.  We just need to make sure that different
>>> cores are not trying to wake up the same thread simultaneously.
>> Why do we care if two cores (I assume one inactive core and one active
>> core) wake up the same thread?
>
> Because "inactive" doesn't mean that we can't receive traffic on them,
> see below.
>
>> 
>> I assume the thread will eventually deal with the upcall of both
>> cores. Yes, performance will be worse than if you had two dedicated
>> threads for these cores. But remember that in this hypothetical
>> scenario one of the cores is supposed to be inactive. There isn't
>> supposed to be any traffic going on this inactive core. We chose the
>> schema in this patch because it makes sure each active core gets a
>> distinct handler thread. This schema also has a nice side effect that
>> as a 'best effort' deals with packets on cores that are inactive where
>> there shouldn't be any traffic, just in case there is traffic. We
>> don't know how many packets will go through inactive cores. I would
>> imagine none.
>> 
>> But you are asking for something that's beyond 'best effort' to deal
>> with upcalls on inactive cores. You are asking for 'best performance'
>> to deal with upcalls on inactive cores. I don't think 'best
>> performance' is necessary for inactive cores, because again there
>> shouldn't be traffic on these inactive cores in the first place.
>
> I think there is a misunderstanding.  "inactive" cores can and will
> receive the traffic.  We're talking about cores not being available
> for ovs-vswitchd userpsace process, but they can still be available
> for the kernel and NIC irqs.   For sure, there are cases where cores
> are fully isolated and kernel doesn't allow any work to be done on
> these cores, unless explicitly asked by a user.  But that's not a
> common case.  From my experience, modern setups tend to dynamically
> limit CPU resources for userspace processes with tuned or cgroups,
> while kernel is free to use cores as it wants.
>
> So, let's look at your example (but with 5 cores):
>
>   c0 c1 c2 c3 c4 c5
>
> Let's say ovs-vswitchd has CPU affinity limited to cores 0, 2 and 4.
> With the current patch socket distribution will look like:
>
>   c0 c1 c2 c3 c4 c5
>   s0 s1 s1 s2 s2 s0
>
> Now, if the IRQ affinity for 2 Rx queues of some network card is set
> to cores 1 and 2, the thread servicing socket s1 will have to handle
> traffic received on both Rx queues.  It's also likely that handling
> interrupts on a different core from where the handler thread is
> has some performance benefits, as upcalls should not really depend on
> cache locality of the packet data (cache locality of userspace structures
> is more important).
>
> And that is the case where 2 cores are trying to wake up the same
> userpsace thread while other threads are not loaded.
>
> And here goes my point that regardless of the way we filling up the
> array, there will always be such worst case scenario that we can
> hit easily.
>
>> Creating inactive handler threads will create more overhead
>
> That is true, but as I said, that is something we need to test and
> measure.
>
>> and we
>> will need to modify the code that determines the number of threads to
>> always be the total number of installed cores in your round-robin
>> solution.
>
> Round-robin solution doesn't require number of threads to be equal to
> number of cores.  If they are equal, there is no "round-robin" component
> as every thread/socket will be chosen only once, i.e. 1:1 mapping.
>
> The point was that round-robin solution is not worse than any other
> distribution of sockets between cores.  To be clear, round-robin:
>
>   c0 c1 c2 c3 c4 c5
>   s0 s1 s2 s0 s1 s2
>
> It's basically the same as what we have right now, but without warnings
> from the kernel.  This will not solve imbalance for the case in the
> commit message, but the solution proposed in this patch will not solve
> imbalance for some other cases and will make imbalance worse for systems
> with different core numbering scheme.
>
>
> On the other hand, the solution to always create threads by the total
> number of cores is the only solution that may solve the imbalance in
> a general case, since there are no corner cases where the same thread
> is servicing packets from more than one core.  But this solution will
> have a scheduling overhead in case many threads are packed on small
> number of cores, but, again, this overhead needs to be tested.
>
> For the code changes, you already have a function to count the total
> number of cores, it doesn't look like we'll need a lot of code
> modifications to change the number of created threads.

This presupposes that user is okay with OvS creating those extra
threads.  Maybe that's okay - I guess the only thing the user cares
about is that we don't run on cores that aren't taskset.  Unfortunately
the per-cpu dispatch eliminated the n-handler-threads configurable
making the assumption that the isolated CPUs are the only threads we
want active.  This might be an incorrect assumption on our part.

I think it makes sense to do round-robin and then we can also have a
configuration for running with n-handler-threads == n_cores (in which
case, round-robin will make 1:1 mapping).  Does this sound right?  I
want to make sure to cover the concerns.

>> That code would also need to change so that the validators
>> stay a proportion of active cores.
>
> That's true, we don't want revalidators to compete for CPU resources.
>
>> I think it is more effort for
>> diminishing returns
>> 
>> 
>> 
>> With all that said, if you still think that 'best performance' is the
>> way to go we can absolutely go down that route. I can get started with
>> a V2 using your proposed round-robin solution as well as the
>> additional code change for number of handlers+validators.
>> 
>> Just let me know if you prefer your round-robin solution. Or if you
>> would like to proceed with the schema proposed in this patch. Either
>> way, let me know your thoughts
>> 
>> Best,
>> Michael Santana
>> 
>> 
>> 
>>>
>>> My point is that any clever mapping schema will have cases where
>>> multiple cores are trying to wake up the same thread.  So, we could
>>> as well fill the array in a round-robin way without thinking too much.
>>>
>>> The only schema where waking up the same thread by different cores
>>> can not happen is a schema where all the values in the array are
>>> different, i.e. number of threads is equal to the total number of
>>> CPU cores.  But yes, that leads up to the question about efficiency
>>> of such solution.
>>>
>>>>
>>>> More handlers than necessary for all cores hurts performance. We
>>>> probably don't want to create handlers for cores that are supposed to
>>>> be inactive. Since they have the affinity to go on any active core
>>>> they will just compete with other handlers for active cores.
>>>
>>> Inactive threads (threads, created for cores where packets are
>>> never received) will always sleep.  Sleeping threads will add a slight
>>> kernel scheduling overhead, but it's negligible in our case, since
>>> number of threads is fairly low.
>>>
>>> If you have 100 cores, but only 4 of them are available to OVS,
>>> that might be a problem, but only if we have a high upcall rate
>>> in kernel on more than 4 cores.  However, in that case, our poor
>>> 4 cores will likely not be able to handle these upcalls anyway
>>> and more cores will need to be made available to OVS.
>>>
>>> High concurrency between handler threads can also be mitigated in some
>>> cases with irq affinity for NICs, i.e. limiting incoming traffic to lower
>>> number of CPU cores if high volume of upcalls is expected.
>>> And there are no clear workarounds for cases where bad distribution is
>>> a result of a "clever" mapping, as this mapping is generally unknown
>>> to the end user.
>>>
>>> In any case, I think, concurrency overhead has to be tested.
>>> It's not obvious that it should be large.
>>>
>>>> What's
>>>> more is that with an increased number of handlers we also have an
>>>> increased number of validators. This would add additional unnecessary
>>>> load and contention
>>>
>>> Is it necessary for number of revalidators to depend on the number
>>> of handlers?  Can we still count them using the number of actually
>>> available CPU cores?
>>>
>>>>
>>>> Additionally, this is the slowpath, we try to process 'quickly' but
>>>> cross-numa slowdown probably isn't as big of a deal here (because the
>>>> packet needs to go to userspace and get processed via the ofproto
>>>> pipeline anyway, which is a much slower operation overall).
>>>
>>> I don't think that cross-NUMA slowdown is a concern.  Kernel scheduler
>>> will try to migrate threads if possible anyway.  Dragging the same
>>> thread back and forth between two different NUMA nodes might
>>> be a larger problem.  But I agree, that it's likely not a big deal
>>> for a handler thread.
>>>
>>>>
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>
>>>
>> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to