Re: [ovs-dev] thundering herd wakeup of handler threads

Aaron Conole Mon, 16 Dec 2019 13:43:41 -0800

David Ahern <dsah...@gmail.com> writes:

> On 12/13/19 1:52 PM, Aaron Conole wrote:
>> Jason Baron via dev <ovs-dev@openvswitch.org> writes:
>> 
>>> On 12/10/19 5:20 PM, David Ahern wrote:
>>>> On 12/10/19 3:09 PM, Jason Baron wrote:
>>>>> Hi David,
>>>>>
>>>>> The idea is that we try and queue new work to 'idle' threads in an
>>>>> attempt to distribute a workload. Thus, once we find an 'idle' thread we
>>>>> stop waking up other threads. While we are searching the wakeup list for
>>>>> idle threads, we do queue an epoll event to the non-idle threads, this
>>>>> doesn't mean they are woken up. It just means that when they go to
>>>>> epoll_wait() to harvest events from the kernel, if the event is still
>>>>> available it will be reported. If the condition for the event is no
>>>>> longer true (because another thread consumed it), they the event
>>>>> wouldn't be visible. So its a way of load balancing a workload while
>>>>> also reducing the number of wakeups. Its 'exclusive' in the sense that
>>>>> it will stop after it finds the first idle thread.
>>>>>
>>>>> We certainly can employ other wakeup strategies - there was interest
>>>>> (and patches) for a strict 'round robin' but that has not been merged
>>>>> upstream.
>>>>>
>>>>> I would like to better understand the current usecase. It sounds like
>>>>> each thread as an epoll file descriptor. And each epoll file descriptor
>>>>> is attached the name netlink socket. But when that netlink socket gets a
>>>>> packet it causes all the threads to wakeup? Are you sure there is just 1
>>>>> netlink socket that all epoll file desciptors are are attached to?
>>>>>
>>>>
>>>> Thanks for the response.
>>>>
>>>> This is the code in question:
>>>>
>>>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492
>>>>
>>>> Yes, prior to finding the above code reference I had traced it to a
>>>> single socket with all handler threads (71 threads on this 96 cpu box)
>>>> on the wait queue.
>>>>
>>>> The ovs kernel module is punting a packet to userspace. It generates a
>>>> netlink message and invokes netlink_unicast. This the stack trace:
>>>>
>>>>         ffffffffad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
>>>>         ffffffffad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
>>>>         ffffffffad257275 pollwake+0x75 ([kernel.kallsyms])
>>>>         ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>>>         ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>>>         ffffffffad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
>>>>         ffffffffad28a4bc ep_call_nested.constprop.18+0xbc
>>>> ([kernel.kallsyms])
>>>>         ffffffffad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
>>>>         ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>>>         ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>>>         ffffffffad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
>>>>         ffffffffad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
>>>>         ffffffffad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
>>>>         ffffffffc07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
>>>>         ffffffffc07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])
>>>>
>>>>
>>>> A probe on sock_def_readable shows it is a single socket that the wait
>>>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
>>>> for each thread). In some cases I see the awakened threads immediately
>>>> running on the target_cpu, while the queue walk continues to wake up all
>>>> threads. Only the first one is going to handle the packet so the rest of
>>>> the wakeups are just noise.
>>>>
>>>> On this system in just a 1-second interval I see this sequence play out
>>>> 400+ times.
>>>>
>>>
>>>
>>> That stack trace is interesting. I *think* it means you are doing
>>> select() or poll() on the epoll file descriptors? I say that because I
>>> see 'pollwake()' in the stack. Notice that pollwake() is only in
>>> fs/select.c.
>> 
>> During ofproto upcall processing, if there aren't any upcalls to
>> receive, a dpif_recv_wait() is called, which will call into
>> dpif_netlink_recv_wait() which will set up the next poll_block() call.
>> 
>> This eventually uses the poll() interface (through time_poll() call) .
>> 
>> Maybe we can confirm the thundering herd by just making a quick hack to
>> shunt out the poll block (included).  Just a quick 'n dirty hack to test
>> with.
>> 
>
> Thanks for the patch.
>
> There is nothing special to be done my end to see the problem and figure
> out solutions. Just run ovs 2.11, have traffic flows that cause upcalls,
> and then watch the scheduling tracepoints.
>
> e.g.,
> sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192
> -- sleep 1
>
> sudo perf sched timehist -w
>
> That output shows on a given CPU the wakeup of every one of the handler
> threads. record with -g to get stack traces and see that it is due to
> the upcall. If you still are not convinced add a probe
>
> perf probe sock_def_readable sk=%di
>
> and add -e probe:* to the record line. you will see 1 hit on
> sock_def_readable and all of the handler wakeups from there.
>
> Note, this is a scalability problem - the number of wakeups are
> proportional to the number of handler threads which is proportional to
> the number of CPUs (by default). Combine that with 100's to 1000's of
> wakeups per second (depends on workload; I measured 400+ per sec on a
> production hypervisor).


Can you try the following and see if your scalability issue is
addressed?  I think it could be better integrated, but this is a
different quick 'n dirty.

diff --git a/lib/timeval.c b/lib/timeval.c
index 193c7bab1..6dcbcb46a 100644
--- a/lib/timeval.c
+++ b/lib/timeval.c
@@ -22,6 +22,7 @@
 #include <signal.h>
 #include <stdlib.h>
 #include <string.h>
+#include <sys/epoll.h>
 #include <sys/time.h>
 #include <sys/resource.h>
 #include <unistd.h>
@@ -39,6 +40,12 @@
 #include "util.h"
 #include "openvswitch/vlog.h"
 
+#ifdef __linux__
+#define LINUX 1
+#else
+#define LINUX 0
+#endif
+
 VLOG_DEFINE_THIS_MODULE(timeval);
 
 #if !defined(HAVE_CLOCK_GETTIME)
@@ -323,7 +330,53 @@ time_poll(struct pollfd *pollfds, int n_pollfds, HANDLE 
*handles OVS_UNUSED,
         }
 
 #ifndef _WIN32
-        retval = poll(pollfds, n_pollfds, time_left);
+        if (LINUX) {
+            struct epoll_event *events =
+                xzalloc(n_pollfds * sizeof(struct epoll_event));
+            size_t i;
+            int efd = epoll_create1(0);
+
+            for (i = 0; i < n_pollfds; ++i) {
+                struct epoll_event evt = {};
+
+#define CHECK_EVENT(e, p)                       \
+                if (pollfds[i].events & p) {     \
+                    evt.events |= e;    \
+                }
+
+                CHECK_EVENT(EPOLLIN, POLLIN);
+                CHECK_EVENT(EPOLLOUT, POLLOUT);
+                CHECK_EVENT(EPOLLERR, POLLERR);
+                CHECK_EVENT(EPOLLHUP, POLLHUP);
+                CHECK_EVENT(EPOLLPRI, POLLPRI);
+#undef CHECK_EVENT
+                pollfds[i].revents = 0;
+
+                epoll_ctl(efd, EPOLL_CTL_ADD, pollfds[i].fd,
+                          &evt);
+            }
+
+            retval = epoll_wait(efd, events, n_pollfds, time_left);
+
+            for (i = 0; i < n_pollfds; ++i) {
+#define CHECK_EVENT(e, p)                       \
+                if (events[i].events & e) {     \
+                    pollfds[i].revents |= p;    \
+                }
+
+                CHECK_EVENT(EPOLLIN, POLLIN);
+                CHECK_EVENT(EPOLLOUT, POLLOUT);
+                CHECK_EVENT(EPOLLERR, POLLERR);
+                CHECK_EVENT(EPOLLHUP, POLLHUP);
+                CHECK_EVENT(EPOLLPRI, POLLPRI);
+#undef CHECK_EVENT
+                epoll_ctl(efd, EPOLL_CTL_DEL, pollfds[i].fd, NULL);
+            }
+            free(events);
+            close(efd);
+        } else {
+            retval = poll(pollfds, n_pollfds, time_left);
+        }
         if (retval < 0) {
             retval = -errno;
         }

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] thundering herd wakeup of handler threads

Reply via email to