Re: [ovs-dev] [PATCH v2] handlers: Fix handlers mapping

Ilya Maximets Thu, 19 May 2022 13:31:41 -0700

On 5/5/22 05:52, Michael Santana wrote:
> On Mon, Apr 25, 2022 at 2:32 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 4/19/22 05:19, Michael Santana wrote:
>>> On Tue, Apr 5, 2022 at 6:32 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>>
>>>> On 4/4/22 14:36, Michael Santana wrote:
>>>>> The handler and CPU mapping in upcalls are incorrect, and this is
>>>>> specially noticeable systems with cpu isolation enabled.
>>>>>
>>>>> Say we have a 12 core system where only every even number CPU is enabled
>>>>> C0, C2, C4, C6, C8, C10
>>>>>
>>>>> This means we will create an array of size 6 that will be sent to
>>>>> kernel that is populated with sockets [S0, S1, S2, S3, S4, S5]
>>>>>
>>>>> The problem is when the kernel does an upcall it checks the socket array
>>>>> via the index of the CPU, effectively adding additional load on some
>>>>> CPUs while leaving no work on other CPUs.
>>>>>
>>>>> e.g.
>>>>>
>>>>> C0  indexes to S0
>>>>> C2  indexes to S2 (should be S1)
>>>>> C4  indexes to S4 (should be S2)
>>>>>
>>>>> Modulo of 6 (size of socket array) is applied, so we wrap back to S0
>>>>> C6  indexes to S0 (should be S3)
>>>>> C8  indexes to S2 (should be S4)
>>>>> C10 indexes to S4 (should be S5)
>>>>>
>>>>> Effectively sockets S0, S2, S4 get overloaded while sockets S1, S3, S5
>>>>> get no work assigned to them
>>>>>
>>>>> This leads to the kernel to throw the following message:
>>>>> "openvswitch: cpu_id mismatch with handler threads"
>>>>>
>>>>> To fix this we send the kernel a corrected array of sockets the size
>>>>> of all CPUs in the system. In the above example we would create a
>>>>> corrected array as follows:
>>>>> [S0, S1, S1, S2, S2, S3, S3, S4, S4, S5, S5, S0]
>>>>>
>>>>> This guarantees that regardless of which CPU a packet comes in the kernel
>>>>> will correctly map it to the correct socket
>>>>
>>>> Hey, Michael.  Thanks for working on this issue!
>>>>
>>>> I agree that the array is too short and ovs-vswitchd has to supply
>>>> a full array with values for each physical core.
>>>> This can be even a separate patch with just filling in the array
>>>> for each physical core in a round-robin manner.
>>>>
>>>> However, I'm not sure it is possible to predict what the good
>>>> distribution of sockets among cores should look like.
>>>>
>>>> Taking your current implementation as an example.  I'm assuming
>>>> that suggested distribution (socket duplication for neighboring
>>>> CPU cores) can be good for a system with 2 NUMA nodes and cores
>>>> enumerated as odd - NUMA0, even - NUMA1.  But the enumeration
>>>> scheme of CPU cores is not set in stone, most multi-NUMA systems
>>>> has a BIOS configuration knob to change the enumeration method.
>>>> One of the most popular ones is: lower cores - NUMA0, higher cores
>>>> - NUMA1.  In this case, since interrupts from a NIC are likely
>>>> arriving on the cores from the same NUMA, sockets S1-S3 will
>>>> carry all the traffic, while sockets S4-S5 will remain unused.
>>>>
>>>> Another point is that we don't really know how many irq vectors
>>>> are registered for NICs and what is the irq affinity for them.
>>>> IOW, we don't know on which cores actual interrupts will be
>>>> handled and on which sockets they will end up.  IRQ affinity
>>>> and number of queues are also dynamic and can be changed over
>>>> time.
>>>>
>>>> So, AFAICT, regardless of the positioning of the sockets in the
>>>> array, there will be very good and very bad cases for it with
>>>> more or less same probability.
>>>>
>>>> It might be that the only option to guarantee more or less even
>>>> load distribution is to always create N handlers, where the N
>>>> is the number of actual physical cores.  And let the scheduler
>>>> and in-kernel IRQ affinity to deal with load balancing.
>>>> Yes, we're adding an extra overhead in case of limited CPU
>>>> affinity for the ovs-vswitchd, but OTOH if we have high volume
>>>> of traffic received on big number of CPU cores, while OVS is
>>>> limited to a single core, that sounds like a misconfiguration.
>>>> Some performance testing is required, for sure.
>>>>
>>>> Thoughts?
>>>
>>>
>>> Thanks for the feedback. There are a couple of points I would like to
>>> make regarding your suggestions and a couple clarifications I would like to 
>>> make
>>> (aaron++ thanks for helping me draft this)
>>>
>>> In this patch, when we pass the array of port ids to the kernel we
>>> fill the entire array with valid port ids, as such we create an array
>>> that is the size of the number of CPUs on the system.
>>>
>>> The question now becomes how do we handle isolated cores. There
>>> shouldn't be any packet processing occurring on any isolated core
>>> since the cores are supposedly isolated. The way the current code
>>> handles isolated cores is stated in the original commit message.
>>> Moving forward we have two choices to deal with the isolated cores. 1.
>>> fill the array with handler portids to allow upcalls to pass, 2. fill
>>> the array with '0' on the isolated cores to drop them.  We chose
>>> option 1 so that upcalls will continue to pass.
>>>
>>> We decided to go with 1 _just in case_, however I do understand if
>>> option 2 is preferable. If you think option 2 is better then it's not
>>> a problem and we can move forward with this
>>
>> Dropping packets is not an option, so we need to fill all the values
>> in the array with correct handler ids.
>>
>>>
>>>
>>> The alternatives you have suggested might not be good solutions either
>>> for the following reasons
>>>
>>> Handlers are not fixed to a single core. They have affinity for all
>>> active cores. They can float along all numa nodes any way. They are
>>> not pinned to a _single_ core regardless of isolations set up. e.g.
>>> [root@wsfd-netdev60 ~]# taskset -p 396826
>>> pid 396826's current affinity mask: 555555
>>
>> I didn't work with the kernel scheduler for quiet a bit now, but
>> IIRC, it tries to wake up threads on cores that shares the same
>> LLC with the one where initial kernel handling was performed, and
>> it will also generally balance the load by moving preempted threads
>> to available cores.  So, the actual cores on which threads are waking
>> up is not our concern.  We just need to make sure that different
>> cores are not trying to wake up the same thread simultaneously.
> Why do we care if two cores (I assume one inactive core and one active
> core) wake up the same thread?


Because "inactive" doesn't mean that we can't receive traffic on them,
see below.

> 
> I assume the thread will eventually deal with the upcall of both
> cores. Yes, performance will be worse than if you had two dedicated
> threads for these cores. But remember that in this hypothetical
> scenario one of the cores is supposed to be inactive. There isn't
> supposed to be any traffic going on this inactive core. We chose the
> schema in this patch because it makes sure each active core gets a
> distinct handler thread. This schema also has a nice side effect that
> as a 'best effort' deals with packets on cores that are inactive where
> there shouldn't be any traffic, just in case there is traffic. We
> don't know how many packets will go through inactive cores. I would
> imagine none.
> 
> But you are asking for something that's beyond 'best effort' to deal
> with upcalls on inactive cores. You are asking for 'best performance'
> to deal with upcalls on inactive cores. I don't think 'best
> performance' is necessary for inactive cores, because again there
> shouldn't be traffic on these inactive cores in the first place.

I think there is a misunderstanding.  "inactive" cores can and will
receive the traffic.  We're talking about cores not being available
for ovs-vswitchd userpsace process, but they can still be available
for the kernel and NIC irqs.   For sure, there are cases where cores
are fully isolated and kernel doesn't allow any work to be done on
these cores, unless explicitly asked by a user.  But that's not a
common case.  From my experience, modern setups tend to dynamically
limit CPU resources for userspace processes with tuned or cgroups,
while kernel is free to use cores as it wants.

So, let's look at your example (but with 5 cores):

  c0 c1 c2 c3 c4 c5

Let's say ovs-vswitchd has CPU affinity limited to cores 0, 2 and 4.
With the current patch socket distribution will look like:

  c0 c1 c2 c3 c4 c5
  s0 s1 s1 s2 s2 s0

Now, if the IRQ affinity for 2 Rx queues of some network card is set
to cores 1 and 2, the thread servicing socket s1 will have to handle
traffic received on both Rx queues.  It's also likely that handling
interrupts on a different core from where the handler thread is
has some performance benefits, as upcalls should not really depend on
cache locality of the packet data (cache locality of userspace structures
is more important).

And that is the case where 2 cores are trying to wake up the same
userpsace thread while other threads are not loaded.

And here goes my point that regardless of the way we filling up the
array, there will always be such worst case scenario that we can
hit easily.

> Creating inactive handler threads will create more overhead

That is true, but as I said, that is something we need to test and
measure.

> and we
> will need to modify the code that determines the number of threads to
> always be the total number of installed cores in your round-robin
> solution.

Round-robin solution doesn't require number of threads to be equal to
number of cores.  If they are equal, there is no "round-robin" component
as every thread/socket will be chosen only once, i.e. 1:1 mapping.

The point was that round-robin solution is not worse than any other
distribution of sockets between cores.  To be clear, round-robin:

  c0 c1 c2 c3 c4 c5
  s0 s1 s2 s0 s1 s2

It's basically the same as what we have right now, but without warnings
from the kernel.  This will not solve imbalance for the case in the
commit message, but the solution proposed in this patch will not solve
imbalance for some other cases and will make imbalance worse for systems
with different core numbering scheme.


On the other hand, the solution to always create threads by the total
number of cores is the only solution that may solve the imbalance in
a general case, since there are no corner cases where the same thread
is servicing packets from more than one core.  But this solution will
have a scheduling overhead in case many threads are packed on small
number of cores, but, again, this overhead needs to be tested.

For the code changes, you already have a function to count the total
number of cores, it doesn't look like we'll need a lot of code
modifications to change the number of created threads.

> That code would also need to change so that the validators
> stay a proportion of active cores.

That's true, we don't want revalidators to compete for CPU resources.

> I think it is more effort for
> diminishing returns
> 
> 
> 
> With all that said, if you still think that 'best performance' is the
> way to go we can absolutely go down that route. I can get started with
> a V2 using your proposed round-robin solution as well as the
> additional code change for number of handlers+validators.
> 
> Just let me know if you prefer your round-robin solution. Or if you
> would like to proceed with the schema proposed in this patch. Either
> way, let me know your thoughts
> 
> Best,
> Michael Santana
> 
> 
> 
>>
>> My point is that any clever mapping schema will have cases where
>> multiple cores are trying to wake up the same thread.  So, we could
>> as well fill the array in a round-robin way without thinking too much.
>>
>> The only schema where waking up the same thread by different cores
>> can not happen is a schema where all the values in the array are
>> different, i.e. number of threads is equal to the total number of
>> CPU cores.  But yes, that leads up to the question about efficiency
>> of such solution.
>>
>>>
>>> More handlers than necessary for all cores hurts performance. We
>>> probably don't want to create handlers for cores that are supposed to
>>> be inactive. Since they have the affinity to go on any active core
>>> they will just compete with other handlers for active cores.
>>
>> Inactive threads (threads, created for cores where packets are
>> never received) will always sleep.  Sleeping threads will add a slight
>> kernel scheduling overhead, but it's negligible in our case, since
>> number of threads is fairly low.
>>
>> If you have 100 cores, but only 4 of them are available to OVS,
>> that might be a problem, but only if we have a high upcall rate
>> in kernel on more than 4 cores.  However, in that case, our poor
>> 4 cores will likely not be able to handle these upcalls anyway
>> and more cores will need to be made available to OVS.
>>
>> High concurrency between handler threads can also be mitigated in some
>> cases with irq affinity for NICs, i.e. limiting incoming traffic to lower
>> number of CPU cores if high volume of upcalls is expected.
>> And there are no clear workarounds for cases where bad distribution is
>> a result of a "clever" mapping, as this mapping is generally unknown
>> to the end user.
>>
>> In any case, I think, concurrency overhead has to be tested.
>> It's not obvious that it should be large.
>>
>>> What's
>>> more is that with an increased number of handlers we also have an
>>> increased number of validators. This would add additional unnecessary
>>> load and contention
>>
>> Is it necessary for number of revalidators to depend on the number
>> of handlers?  Can we still count them using the number of actually
>> available CPU cores?
>>
>>>
>>> Additionally, this is the slowpath, we try to process 'quickly' but
>>> cross-numa slowdown probably isn't as big of a deal here (because the
>>> packet needs to go to userspace and get processed via the ofproto
>>> pipeline anyway, which is a much slower operation overall).
>>
>> I don't think that cross-NUMA slowdown is a concern.  Kernel scheduler
>> will try to migrate threads if possible anyway.  Dragging the same
>> thread back and forth between two different NUMA nodes might
>> be a larger problem.  But I agree, that it's likely not a big deal
>> for a handler thread.
>>
>>>
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>
>>
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2] handlers: Fix handlers mapping

Reply via email to