Re: [ovs-dev] thundering herd wakeup of handler threads

David Ahern Fri, 13 Dec 2019 13:29:27 -0800

On 12/13/19 1:52 PM, Aaron Conole wrote:
> Jason Baron via dev <ovs-dev@openvswitch.org> writes:
> 
>> On 12/10/19 5:20 PM, David Ahern wrote:
>>> On 12/10/19 3:09 PM, Jason Baron wrote:
>>>> Hi David,
>>>>
>>>> The idea is that we try and queue new work to 'idle' threads in an
>>>> attempt to distribute a workload. Thus, once we find an 'idle' thread we
>>>> stop waking up other threads. While we are searching the wakeup list for
>>>> idle threads, we do queue an epoll event to the non-idle threads, this
>>>> doesn't mean they are woken up. It just means that when they go to
>>>> epoll_wait() to harvest events from the kernel, if the event is still
>>>> available it will be reported. If the condition for the event is no
>>>> longer true (because another thread consumed it), they the event
>>>> wouldn't be visible. So its a way of load balancing a workload while
>>>> also reducing the number of wakeups. Its 'exclusive' in the sense that
>>>> it will stop after it finds the first idle thread.
>>>>
>>>> We certainly can employ other wakeup strategies - there was interest
>>>> (and patches) for a strict 'round robin' but that has not been merged
>>>> upstream.
>>>>
>>>> I would like to better understand the current usecase. It sounds like
>>>> each thread as an epoll file descriptor. And each epoll file descriptor
>>>> is attached the name netlink socket. But when that netlink socket gets a
>>>> packet it causes all the threads to wakeup? Are you sure there is just 1
>>>> netlink socket that all epoll file desciptors are are attached to?
>>>>
>>>
>>> Thanks for the response.
>>>
>>> This is the code in question:
>>>
>>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492
>>>
>>> Yes, prior to finding the above code reference I had traced it to a
>>> single socket with all handler threads (71 threads on this 96 cpu box)
>>> on the wait queue.
>>>
>>> The ovs kernel module is punting a packet to userspace. It generates a
>>> netlink message and invokes netlink_unicast. This the stack trace:
>>>
>>>         ffffffffad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
>>>         ffffffffad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
>>>         ffffffffad257275 pollwake+0x75 ([kernel.kallsyms])
>>>         ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>>         ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>>         ffffffffad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
>>>         ffffffffad28a4bc ep_call_nested.constprop.18+0xbc
>>> ([kernel.kallsyms])
>>>         ffffffffad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
>>>         ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>>         ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>>         ffffffffad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
>>>         ffffffffad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
>>>         ffffffffad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
>>>         ffffffffc07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
>>>         ffffffffc07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])
>>>
>>>
>>> A probe on sock_def_readable shows it is a single socket that the wait
>>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
>>> for each thread). In some cases I see the awakened threads immediately
>>> running on the target_cpu, while the queue walk continues to wake up all
>>> threads. Only the first one is going to handle the packet so the rest of
>>> the wakeups are just noise.
>>>
>>> On this system in just a 1-second interval I see this sequence play out
>>> 400+ times.
>>>
>>
>>
>> That stack trace is interesting. I *think* it means you are doing
>> select() or poll() on the epoll file descriptors? I say that because I
>> see 'pollwake()' in the stack. Notice that pollwake() is only in
>> fs/select.c.
> 
> During ofproto upcall processing, if there aren't any upcalls to
> receive, a dpif_recv_wait() is called, which will call into
> dpif_netlink_recv_wait() which will set up the next poll_block() call.
> 
> This eventually uses the poll() interface (through time_poll() call) .
> 
> Maybe we can confirm the thundering herd by just making a quick hack to
> shunt out the poll block (included).  Just a quick 'n dirty hack to test
> with.
>


Thanks for the patch.

There is nothing special to be done my end to see the problem and figure
out solutions. Just run ovs 2.11, have traffic flows that cause upcalls,
and then watch the scheduling tracepoints.

e.g.,
sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192
-- sleep 1

sudo perf sched timehist -w

That output shows on a given CPU the wakeup of every one of the handler
threads. record with -g to get stack traces and see that it is due to
the upcall. If you still are not convinced add a probe

perf probe sock_def_readable sk=%di

and add -e probe:* to the record line. you will see 1 hit on
sock_def_readable and all of the handler wakeups from there.

Note, this is a scalability problem - the number of wakeups are
proportional to the number of handler threads which is proportional to
the number of CPUs (by default). Combine that with 100's to 1000's of
wakeups per second (depends on workload; I measured 400+ per sec on a
production hypervisor).
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] thundering herd wakeup of handler threads

Reply via email to