On 12/10/19 4:00 PM, David Ahern wrote:
> [ adding Jason as author of the patch that added the epoll exclusive flag ]
> 
> On 12/10/19 12:37 PM, Matteo Croce wrote:
>> On Tue, Dec 10, 2019 at 8:13 PM David Ahern <dsah...@gmail.com> wrote:
>>>
>>> Hi Matteo:
>>>
>>> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
>>> thundering herd wake up problem. Every packet punted to userspace wakes
>>> up every one of the handler threads. On a box with 96 cpus, there are 71
>>> handler threads which means 71 process wakeups for every packet punted.
>>>
>>> This is really easy to see, just watch sched:sched_wakeup tracepoints.
>>> With a few extra probes:
>>>
>>> perf probe sock_def_readable sk=%di
>>> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
>>> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
>>> wake_flags=%cx key=%8
>>>
>>> you can see there is a single netlink socket and its wait queue contains
>>> an entry for every handler thread.
>>>
>>> This does not happen with the 2.7.3 version. Roaming commits it appears
>>> that the change in behavior comes from this commit:
>>>
>>> commit 69c51582ff786a68fc325c1c50624715482bc460
>>> Author: Matteo Croce <mcr...@redhat.com>
>>> Date:   Tue Sep 25 10:51:05 2018 +0200
>>>
>>>     dpif-netlink: don't allocate per thread netlink sockets
>>>
>>>
>>> Is this a known problem?
>>>
>>> David
>>>
>>
>> Hi David,
>>
>> before my patch, vswitchd created NxM sockets, being N the ports and M
>> the active cores,
>> because every thread opens a netlink socket per port.
>>
>> With my patch, a pool is created with N socket, one per port, and all
>> the threads polls the same list
>> with the EPOLLEXCLUSIVE flag.
>> As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
>> of the waiting threads.
>>
>> I'm not aware of this problem, but it goes against the intended
>> behaviour of EPOLLEXCLUSIVE.
>> Such flag exists since Linux 4.5, can you check that it's passed
>> correctly to epoll()?
>>
> 
> This the commit that added the EXCLUSIVE flag:
> 
> commit df0108c5da561c66c333bb46bfe3c1fc65905898
> Author: Jason Baron <jba...@akamai.com>
> Date:   Wed Jan 20 14:59:24 2016 -0800
> 
>     epoll: add EPOLLEXCLUSIVE flag
> 
> 
> The commit message acknowledges that multiple threads can still be awakened:
> 
> "The implementation walks the list of exclusive waiters, and queues an
> event to each epfd, until it finds the first waiter that has threads
> blocked on it via epoll_wait().  The idea is to search for threads which
> are idle and ready to process the wakeup events.  Thus, we queue an
> event to at least 1 epfd, but may still potentially queue an event to
> all epfds that are attached to the shared fd source."
> 
> To me that means all idle handler threads are going to be awakened on
> each upcall message even though only 1 is needed to handle the message.
> 
> Jason: What was the rationale behind the exclusive flag that still wakes
> up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N
> handler threads awakened on every single event which is a horrible
> scaling property.
> 

Hi David,

The idea is that we try and queue new work to 'idle' threads in an
attempt to distribute a workload. Thus, once we find an 'idle' thread we
stop waking up other threads. While we are searching the wakeup list for
idle threads, we do queue an epoll event to the non-idle threads, this
doesn't mean they are woken up. It just means that when they go to
epoll_wait() to harvest events from the kernel, if the event is still
available it will be reported. If the condition for the event is no
longer true (because another thread consumed it), they the event
wouldn't be visible. So its a way of load balancing a workload while
also reducing the number of wakeups. Its 'exclusive' in the sense that
it will stop after it finds the first idle thread.

We certainly can employ other wakeup strategies - there was interest
(and patches) for a strict 'round robin' but that has not been merged
upstream.

I would like to better understand the current usecase. It sounds like
each thread as an epoll file descriptor. And each epoll file descriptor
is attached the name netlink socket. But when that netlink socket gets a
packet it causes all the threads to wakeup? Are you sure there is just 1
netlink socket that all epoll file desciptors are are attached to?

Thanks,

-Jason



_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to