David Ahern <dsah...@gmail.com> writes: > On 12/13/19 1:52 PM, Aaron Conole wrote: >> Jason Baron via dev <ovs-dev@openvswitch.org> writes: >> >>> On 12/10/19 5:20 PM, David Ahern wrote: >>>> On 12/10/19 3:09 PM, Jason Baron wrote: >>>>> Hi David, >>>>> >>>>> The idea is that we try and queue new work to 'idle' threads in an >>>>> attempt to distribute a workload. Thus, once we find an 'idle' thread we >>>>> stop waking up other threads. While we are searching the wakeup list for >>>>> idle threads, we do queue an epoll event to the non-idle threads, this >>>>> doesn't mean they are woken up. It just means that when they go to >>>>> epoll_wait() to harvest events from the kernel, if the event is still >>>>> available it will be reported. If the condition for the event is no >>>>> longer true (because another thread consumed it), they the event >>>>> wouldn't be visible. So its a way of load balancing a workload while >>>>> also reducing the number of wakeups. Its 'exclusive' in the sense that >>>>> it will stop after it finds the first idle thread. >>>>> >>>>> We certainly can employ other wakeup strategies - there was interest >>>>> (and patches) for a strict 'round robin' but that has not been merged >>>>> upstream. >>>>> >>>>> I would like to better understand the current usecase. It sounds like >>>>> each thread as an epoll file descriptor. And each epoll file descriptor >>>>> is attached the name netlink socket. But when that netlink socket gets a >>>>> packet it causes all the threads to wakeup? Are you sure there is just 1 >>>>> netlink socket that all epoll file desciptors are are attached to? >>>>> >>>> >>>> Thanks for the response. >>>> >>>> This is the code in question: >>>> >>>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 >>>> >>>> Yes, prior to finding the above code reference I had traced it to a >>>> single socket with all handler threads (71 threads on this 96 cpu box) >>>> on the wait queue. >>>> >>>> The ovs kernel module is punting a packet to userspace. It generates a >>>> netlink message and invokes netlink_unicast. This the stack trace: >>>> >>>> ffffffffad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) >>>> ffffffffad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) >>>> ffffffffad257275 pollwake+0x75 ([kernel.kallsyms]) >>>> ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >>>> ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >>>> ffffffffad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) >>>> ffffffffad28a4bc ep_call_nested.constprop.18+0xbc >>>> ([kernel.kallsyms]) >>>> ffffffffad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) >>>> ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >>>> ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >>>> ffffffffad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) >>>> ffffffffad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) >>>> ffffffffad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) >>>> ffffffffc07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) >>>> ffffffffc07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) >>>> >>>> >>>> A probe on sock_def_readable shows it is a single socket that the wait >>>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once >>>> for each thread). In some cases I see the awakened threads immediately >>>> running on the target_cpu, while the queue walk continues to wake up all >>>> threads. Only the first one is going to handle the packet so the rest of >>>> the wakeups are just noise. >>>> >>>> On this system in just a 1-second interval I see this sequence play out >>>> 400+ times. >>>> >>> >>> >>> That stack trace is interesting. I *think* it means you are doing >>> select() or poll() on the epoll file descriptors? I say that because I >>> see 'pollwake()' in the stack. Notice that pollwake() is only in >>> fs/select.c. >> >> During ofproto upcall processing, if there aren't any upcalls to >> receive, a dpif_recv_wait() is called, which will call into >> dpif_netlink_recv_wait() which will set up the next poll_block() call. >> >> This eventually uses the poll() interface (through time_poll() call) . >> >> Maybe we can confirm the thundering herd by just making a quick hack to >> shunt out the poll block (included). Just a quick 'n dirty hack to test >> with. >> > > Thanks for the patch. > > There is nothing special to be done my end to see the problem and figure > out solutions. Just run ovs 2.11, have traffic flows that cause upcalls, > and then watch the scheduling tracepoints. > > e.g., > sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192 > -- sleep 1 > > sudo perf sched timehist -w > > That output shows on a given CPU the wakeup of every one of the handler > threads. record with -g to get stack traces and see that it is due to > the upcall. If you still are not convinced add a probe > > perf probe sock_def_readable sk=%di > > and add -e probe:* to the record line. you will see 1 hit on > sock_def_readable and all of the handler wakeups from there. > > Note, this is a scalability problem - the number of wakeups are > proportional to the number of handler threads which is proportional to > the number of CPUs (by default). Combine that with 100's to 1000's of > wakeups per second (depends on workload; I measured 400+ per sec on a > production hypervisor).
Can you try the following and see if your scalability issue is addressed? I think it could be better integrated, but this is a different quick 'n dirty. diff --git a/lib/timeval.c b/lib/timeval.c index 193c7bab1..6dcbcb46a 100644 --- a/lib/timeval.c +++ b/lib/timeval.c @@ -22,6 +22,7 @@ #include <signal.h> #include <stdlib.h> #include <string.h> +#include <sys/epoll.h> #include <sys/time.h> #include <sys/resource.h> #include <unistd.h> @@ -39,6 +40,12 @@ #include "util.h" #include "openvswitch/vlog.h" +#ifdef __linux__ +#define LINUX 1 +#else +#define LINUX 0 +#endif + VLOG_DEFINE_THIS_MODULE(timeval); #if !defined(HAVE_CLOCK_GETTIME) @@ -323,7 +330,53 @@ time_poll(struct pollfd *pollfds, int n_pollfds, HANDLE *handles OVS_UNUSED, } #ifndef _WIN32 - retval = poll(pollfds, n_pollfds, time_left); + if (LINUX) { + struct epoll_event *events = + xzalloc(n_pollfds * sizeof(struct epoll_event)); + size_t i; + int efd = epoll_create1(0); + + for (i = 0; i < n_pollfds; ++i) { + struct epoll_event evt = {}; + +#define CHECK_EVENT(e, p) \ + if (pollfds[i].events & p) { \ + evt.events |= e; \ + } + + CHECK_EVENT(EPOLLIN, POLLIN); + CHECK_EVENT(EPOLLOUT, POLLOUT); + CHECK_EVENT(EPOLLERR, POLLERR); + CHECK_EVENT(EPOLLHUP, POLLHUP); + CHECK_EVENT(EPOLLPRI, POLLPRI); +#undef CHECK_EVENT + pollfds[i].revents = 0; + + epoll_ctl(efd, EPOLL_CTL_ADD, pollfds[i].fd, + &evt); + } + + retval = epoll_wait(efd, events, n_pollfds, time_left); + + for (i = 0; i < n_pollfds; ++i) { +#define CHECK_EVENT(e, p) \ + if (events[i].events & e) { \ + pollfds[i].revents |= p; \ + } + + CHECK_EVENT(EPOLLIN, POLLIN); + CHECK_EVENT(EPOLLOUT, POLLOUT); + CHECK_EVENT(EPOLLERR, POLLERR); + CHECK_EVENT(EPOLLHUP, POLLHUP); + CHECK_EVENT(EPOLLPRI, POLLPRI); +#undef CHECK_EVENT + epoll_ctl(efd, EPOLL_CTL_DEL, pollfds[i].fd, NULL); + } + free(events); + close(efd); + } else { + retval = poll(pollfds, n_pollfds, time_left); + } if (retval < 0) { retval = -errno; } _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev