Re: [ovs-dev] thundering herd wakeup of handler threads
David Ahern writes: > On 1/8/20 9:18 AM, Aaron Conole wrote: >> David Ahern writes: >> >>> On 12/16/19 2:42 PM, Aaron Conole wrote: Can you try the following and see if your scalability issue is addressed? I think it could be better integrated, but this is a different quick 'n dirty. >>> >>> your patch reduces the number of threads awakened, but it is still >>> really high - 43 out of 71 handler threads in one test. >> >> Thanks for getting back to me (sorry for the late response, catching >> up). I'll take a closer look. I haven't thought about adding >> EPOLLEXCLUSIVE to the poll block change, so maybe that will address it >> completely, or maybe there's a different approach. >> > > This seems to have gone quiet for a few months now. In case my previous > comments weren't clear, the change in question > (69c51582ff786a68fc325c1c50624715482bc460) converted a control plane > scaling problem into a datapath scaling problem. It manifests as an > increase in squeezed counts and dropped packets (both at the enqueue > point and at hand off to the vhost threads) because of the increased > time spent on upcalls. > > I have reverted that commit and several on top of it for our purposes, > but this really is something OVS devs should fix. There is a patch set in the works that I think should help: https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/368139.html https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/368138.html I will take a closer look at it. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 1/8/20 9:18 AM, Aaron Conole wrote: > David Ahern writes: > >> On 12/16/19 2:42 PM, Aaron Conole wrote: >>> Can you try the following and see if your scalability issue is >>> addressed? I think it could be better integrated, but this is a >>> different quick 'n dirty. >> >> your patch reduces the number of threads awakened, but it is still >> really high - 43 out of 71 handler threads in one test. > > Thanks for getting back to me (sorry for the late response, catching > up). I'll take a closer look. I haven't thought about adding > EPOLLEXCLUSIVE to the poll block change, so maybe that will address it > completely, or maybe there's a different approach. > This seems to have gone quiet for a few months now. In case my previous comments weren't clear, the change in question (69c51582ff786a68fc325c1c50624715482bc460) converted a control plane scaling problem into a datapath scaling problem. It manifests as an increase in squeezed counts and dropped packets (both at the enqueue point and at hand off to the vhost threads) because of the increased time spent on upcalls. I have reverted that commit and several on top of it for our purposes, but this really is something OVS devs should fix. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
David Ahern writes: > On 12/16/19 2:42 PM, Aaron Conole wrote: >> Can you try the following and see if your scalability issue is >> addressed? I think it could be better integrated, but this is a >> different quick 'n dirty. > > your patch reduces the number of threads awakened, but it is still > really high - 43 out of 71 handler threads in one test. Thanks for getting back to me (sorry for the late response, catching up). I'll take a closer look. I haven't thought about adding EPOLLEXCLUSIVE to the poll block change, so maybe that will address it completely, or maybe there's a different approach. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/16/19 2:42 PM, Aaron Conole wrote: > Can you try the following and see if your scalability issue is > addressed? I think it could be better integrated, but this is a > different quick 'n dirty. your patch reduces the number of threads awakened, but it is still really high - 43 out of 71 handler threads in one test. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/16/19 2:42 PM, Aaron Conole wrote: > Can you try the following and see if your scalability issue is > addressed? I think it could be better integrated, but this is a > different quick 'n dirty. I'll to get to it before the end of the week. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
David Ahern writes: > On 12/13/19 1:52 PM, Aaron Conole wrote: >> Jason Baron via dev writes: >> >>> On 12/10/19 5:20 PM, David Ahern wrote: On 12/10/19 3:09 PM, Jason Baron wrote: > Hi David, > > The idea is that we try and queue new work to 'idle' threads in an > attempt to distribute a workload. Thus, once we find an 'idle' thread we > stop waking up other threads. While we are searching the wakeup list for > idle threads, we do queue an epoll event to the non-idle threads, this > doesn't mean they are woken up. It just means that when they go to > epoll_wait() to harvest events from the kernel, if the event is still > available it will be reported. If the condition for the event is no > longer true (because another thread consumed it), they the event > wouldn't be visible. So its a way of load balancing a workload while > also reducing the number of wakeups. Its 'exclusive' in the sense that > it will stop after it finds the first idle thread. > > We certainly can employ other wakeup strategies - there was interest > (and patches) for a strict 'round robin' but that has not been merged > upstream. > > I would like to better understand the current usecase. It sounds like > each thread as an epoll file descriptor. And each epoll file descriptor > is attached the name netlink socket. But when that netlink socket gets a > packet it causes all the threads to wakeup? Are you sure there is just 1 > netlink socket that all epoll file desciptors are are attached to? > Thanks for the response. This is the code in question: https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 Yes, prior to finding the above code reference I had traced it to a single socket with all handler threads (71 threads on this 96 cpu box) on the wait queue. The ovs kernel module is punting a packet to userspace. It generates a netlink message and invokes netlink_unicast. This the stack trace: ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) ad257275 pollwake+0x75 ([kernel.kallsyms]) ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) ad28a4bc ep_call_nested.constprop.18+0xbc ([kernel.kallsyms]) ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) ad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) A probe on sock_def_readable shows it is a single socket that the wait queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once for each thread). In some cases I see the awakened threads immediately running on the target_cpu, while the queue walk continues to wake up all threads. Only the first one is going to handle the packet so the rest of the wakeups are just noise. On this system in just a 1-second interval I see this sequence play out 400+ times. >>> >>> >>> That stack trace is interesting. I *think* it means you are doing >>> select() or poll() on the epoll file descriptors? I say that because I >>> see 'pollwake()' in the stack. Notice that pollwake() is only in >>> fs/select.c. >> >> During ofproto upcall processing, if there aren't any upcalls to >> receive, a dpif_recv_wait() is called, which will call into >> dpif_netlink_recv_wait() which will set up the next poll_block() call. >> >> This eventually uses the poll() interface (through time_poll() call) . >> >> Maybe we can confirm the thundering herd by just making a quick hack to >> shunt out the poll block (included). Just a quick 'n dirty hack to test >> with. >> > > Thanks for the patch. > > There is nothing special to be done my end to see the problem and figure > out solutions. Just run ovs 2.11, have traffic flows that cause upcalls, > and then watch the scheduling tracepoints. > > e.g., > sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192 > -- sleep 1 > > sudo perf sched timehist -w > > That output shows on a given CPU the wakeup of every one of the handler > threads. record with -g to get stack traces and see that it is due to > the upcall. If you still are not convinced add a probe > > perf probe sock_
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/13/19 1:52 PM, Aaron Conole wrote: > Jason Baron via dev writes: > >> On 12/10/19 5:20 PM, David Ahern wrote: >>> On 12/10/19 3:09 PM, Jason Baron wrote: Hi David, The idea is that we try and queue new work to 'idle' threads in an attempt to distribute a workload. Thus, once we find an 'idle' thread we stop waking up other threads. While we are searching the wakeup list for idle threads, we do queue an epoll event to the non-idle threads, this doesn't mean they are woken up. It just means that when they go to epoll_wait() to harvest events from the kernel, if the event is still available it will be reported. If the condition for the event is no longer true (because another thread consumed it), they the event wouldn't be visible. So its a way of load balancing a workload while also reducing the number of wakeups. Its 'exclusive' in the sense that it will stop after it finds the first idle thread. We certainly can employ other wakeup strategies - there was interest (and patches) for a strict 'round robin' but that has not been merged upstream. I would like to better understand the current usecase. It sounds like each thread as an epoll file descriptor. And each epoll file descriptor is attached the name netlink socket. But when that netlink socket gets a packet it causes all the threads to wakeup? Are you sure there is just 1 netlink socket that all epoll file desciptors are are attached to? >>> >>> Thanks for the response. >>> >>> This is the code in question: >>> >>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 >>> >>> Yes, prior to finding the above code reference I had traced it to a >>> single socket with all handler threads (71 threads on this 96 cpu box) >>> on the wait queue. >>> >>> The ovs kernel module is punting a packet to userspace. It generates a >>> netlink message and invokes netlink_unicast. This the stack trace: >>> >>> ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) >>> ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) >>> ad257275 pollwake+0x75 ([kernel.kallsyms]) >>> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >>> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >>> ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) >>> ad28a4bc ep_call_nested.constprop.18+0xbc >>> ([kernel.kallsyms]) >>> ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) >>> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >>> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >>> ad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) >>> ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) >>> ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) >>> c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) >>> c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) >>> >>> >>> A probe on sock_def_readable shows it is a single socket that the wait >>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once >>> for each thread). In some cases I see the awakened threads immediately >>> running on the target_cpu, while the queue walk continues to wake up all >>> threads. Only the first one is going to handle the packet so the rest of >>> the wakeups are just noise. >>> >>> On this system in just a 1-second interval I see this sequence play out >>> 400+ times. >>> >> >> >> That stack trace is interesting. I *think* it means you are doing >> select() or poll() on the epoll file descriptors? I say that because I >> see 'pollwake()' in the stack. Notice that pollwake() is only in >> fs/select.c. > > During ofproto upcall processing, if there aren't any upcalls to > receive, a dpif_recv_wait() is called, which will call into > dpif_netlink_recv_wait() which will set up the next poll_block() call. > > This eventually uses the poll() interface (through time_poll() call) . > > Maybe we can confirm the thundering herd by just making a quick hack to > shunt out the poll block (included). Just a quick 'n dirty hack to test > with. > Thanks for the patch. There is nothing special to be done my end to see the problem and figure out solutions. Just run ovs 2.11, have traffic flows that cause upcalls, and then watch the scheduling tracepoints. e.g., sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192 -- sleep 1 sudo perf sched timehist -w That output shows on a given CPU the wakeup of every one of the handler threads. record with -g to get stack traces and see that it is due to the upcall. If you still are not convinced add a probe perf probe sock_def_readable sk=%di and add -e probe:* to the record line. you will see 1 hit on sock_def_readable and all of the handler wakeups from there
Re: [ovs-dev] thundering herd wakeup of handler threads
Jason Baron via dev writes: > On 12/10/19 5:20 PM, David Ahern wrote: >> On 12/10/19 3:09 PM, Jason Baron wrote: >>> Hi David, >>> >>> The idea is that we try and queue new work to 'idle' threads in an >>> attempt to distribute a workload. Thus, once we find an 'idle' thread we >>> stop waking up other threads. While we are searching the wakeup list for >>> idle threads, we do queue an epoll event to the non-idle threads, this >>> doesn't mean they are woken up. It just means that when they go to >>> epoll_wait() to harvest events from the kernel, if the event is still >>> available it will be reported. If the condition for the event is no >>> longer true (because another thread consumed it), they the event >>> wouldn't be visible. So its a way of load balancing a workload while >>> also reducing the number of wakeups. Its 'exclusive' in the sense that >>> it will stop after it finds the first idle thread. >>> >>> We certainly can employ other wakeup strategies - there was interest >>> (and patches) for a strict 'round robin' but that has not been merged >>> upstream. >>> >>> I would like to better understand the current usecase. It sounds like >>> each thread as an epoll file descriptor. And each epoll file descriptor >>> is attached the name netlink socket. But when that netlink socket gets a >>> packet it causes all the threads to wakeup? Are you sure there is just 1 >>> netlink socket that all epoll file desciptors are are attached to? >>> >> >> Thanks for the response. >> >> This is the code in question: >> >> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 >> >> Yes, prior to finding the above code reference I had traced it to a >> single socket with all handler threads (71 threads on this 96 cpu box) >> on the wait queue. >> >> The ovs kernel module is punting a packet to userspace. It generates a >> netlink message and invokes netlink_unicast. This the stack trace: >> >> ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) >> ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) >> ad257275 pollwake+0x75 ([kernel.kallsyms]) >> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >> ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) >> ad28a4bc ep_call_nested.constprop.18+0xbc >> ([kernel.kallsyms]) >> ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) >> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >> ad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) >> ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) >> ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) >> c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) >> c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) >> >> >> A probe on sock_def_readable shows it is a single socket that the wait >> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once >> for each thread). In some cases I see the awakened threads immediately >> running on the target_cpu, while the queue walk continues to wake up all >> threads. Only the first one is going to handle the packet so the rest of >> the wakeups are just noise. >> >> On this system in just a 1-second interval I see this sequence play out >> 400+ times. >> > > > That stack trace is interesting. I *think* it means you are doing > select() or poll() on the epoll file descriptors? I say that because I > see 'pollwake()' in the stack. Notice that pollwake() is only in > fs/select.c. During ofproto upcall processing, if there aren't any upcalls to receive, a dpif_recv_wait() is called, which will call into dpif_netlink_recv_wait() which will set up the next poll_block() call. This eventually uses the poll() interface (through time_poll() call) . Maybe we can confirm the thundering herd by just making a quick hack to shunt out the poll block (included). Just a quick 'n dirty hack to test with. > Thanks, > > -jason NOTE: this change will trigger warnings about unused parameters. But it will prevent us from registering on the poll_loop for the thread. Keep in mind there will be high cpu utilization since the poll_block will never sleep (and this is a hack... so there's that) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index d1f9b81db..804373c3a 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -2709,7 +2709,7 @@ dpif_netlink_recv_wait__(struct dpif_netlink *dpif, uint32_t handler_id) if (dpif->handlers && handler_id < dpif->n_handlers) { struct dpif_handler *handler = &dpif->handlers[handler_id]; -poll_fd_wait(handler->epoll_fd, POLLIN); +poll_immediate_wake(); /* now we never sleep or poll the handler */ } #endif }
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/10/19 5:20 PM, David Ahern wrote: > On 12/10/19 3:09 PM, Jason Baron wrote: >> Hi David, >> >> The idea is that we try and queue new work to 'idle' threads in an >> attempt to distribute a workload. Thus, once we find an 'idle' thread we >> stop waking up other threads. While we are searching the wakeup list for >> idle threads, we do queue an epoll event to the non-idle threads, this >> doesn't mean they are woken up. It just means that when they go to >> epoll_wait() to harvest events from the kernel, if the event is still >> available it will be reported. If the condition for the event is no >> longer true (because another thread consumed it), they the event >> wouldn't be visible. So its a way of load balancing a workload while >> also reducing the number of wakeups. Its 'exclusive' in the sense that >> it will stop after it finds the first idle thread. >> >> We certainly can employ other wakeup strategies - there was interest >> (and patches) for a strict 'round robin' but that has not been merged >> upstream. >> >> I would like to better understand the current usecase. It sounds like >> each thread as an epoll file descriptor. And each epoll file descriptor >> is attached the name netlink socket. But when that netlink socket gets a >> packet it causes all the threads to wakeup? Are you sure there is just 1 >> netlink socket that all epoll file desciptors are are attached to? >> > > Thanks for the response. > > This is the code in question: > > https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 > > Yes, prior to finding the above code reference I had traced it to a > single socket with all handler threads (71 threads on this 96 cpu box) > on the wait queue. > > The ovs kernel module is punting a packet to userspace. It generates a > netlink message and invokes netlink_unicast. This the stack trace: > > ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) > ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) > ad257275 pollwake+0x75 ([kernel.kallsyms]) > ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) > ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) > ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) > ad28a4bc ep_call_nested.constprop.18+0xbc > ([kernel.kallsyms]) > ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) > ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) > ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) > ad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) > ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) > ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) > c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) > c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) > > > A probe on sock_def_readable shows it is a single socket that the wait > queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once > for each thread). In some cases I see the awakened threads immediately > running on the target_cpu, while the queue walk continues to wake up all > threads. Only the first one is going to handle the packet so the rest of > the wakeups are just noise. > > On this system in just a 1-second interval I see this sequence play out > 400+ times. > That stack trace is interesting. I *think* it means you are doing select() or poll() on the epoll file descriptors? I say that because I see 'pollwake()' in the stack. Notice that pollwake() is only in fs/select.c. Thanks, -jason ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/10/19 4:00 PM, David Ahern wrote: > [ adding Jason as author of the patch that added the epoll exclusive flag ] > > On 12/10/19 12:37 PM, Matteo Croce wrote: >> On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: >>> >>> Hi Matteo: >>> >>> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a >>> thundering herd wake up problem. Every packet punted to userspace wakes >>> up every one of the handler threads. On a box with 96 cpus, there are 71 >>> handler threads which means 71 process wakeups for every packet punted. >>> >>> This is really easy to see, just watch sched:sched_wakeup tracepoints. >>> With a few extra probes: >>> >>> perf probe sock_def_readable sk=%di >>> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx >>> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx >>> wake_flags=%cx key=%8 >>> >>> you can see there is a single netlink socket and its wait queue contains >>> an entry for every handler thread. >>> >>> This does not happen with the 2.7.3 version. Roaming commits it appears >>> that the change in behavior comes from this commit: >>> >>> commit 69c51582ff786a68fc325c1c50624715482bc460 >>> Author: Matteo Croce >>> Date: Tue Sep 25 10:51:05 2018 +0200 >>> >>> dpif-netlink: don't allocate per thread netlink sockets >>> >>> >>> Is this a known problem? >>> >>> David >>> >> >> Hi David, >> >> before my patch, vswitchd created NxM sockets, being N the ports and M >> the active cores, >> because every thread opens a netlink socket per port. >> >> With my patch, a pool is created with N socket, one per port, and all >> the threads polls the same list >> with the EPOLLEXCLUSIVE flag. >> As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one >> of the waiting threads. >> >> I'm not aware of this problem, but it goes against the intended >> behaviour of EPOLLEXCLUSIVE. >> Such flag exists since Linux 4.5, can you check that it's passed >> correctly to epoll()? >> > > This the commit that added the EXCLUSIVE flag: > > commit df0108c5da561c66c333bb46bfe3c1fc65905898 > Author: Jason Baron > Date: Wed Jan 20 14:59:24 2016 -0800 > > epoll: add EPOLLEXCLUSIVE flag > > > The commit message acknowledges that multiple threads can still be awakened: > > "The implementation walks the list of exclusive waiters, and queues an > event to each epfd, until it finds the first waiter that has threads > blocked on it via epoll_wait(). The idea is to search for threads which > are idle and ready to process the wakeup events. Thus, we queue an > event to at least 1 epfd, but may still potentially queue an event to > all epfds that are attached to the shared fd source." > > To me that means all idle handler threads are going to be awakened on > each upcall message even though only 1 is needed to handle the message. > > Jason: What was the rationale behind the exclusive flag that still wakes > up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N > handler threads awakened on every single event which is a horrible > scaling property. > Hi David, The idea is that we try and queue new work to 'idle' threads in an attempt to distribute a workload. Thus, once we find an 'idle' thread we stop waking up other threads. While we are searching the wakeup list for idle threads, we do queue an epoll event to the non-idle threads, this doesn't mean they are woken up. It just means that when they go to epoll_wait() to harvest events from the kernel, if the event is still available it will be reported. If the condition for the event is no longer true (because another thread consumed it), they the event wouldn't be visible. So its a way of load balancing a workload while also reducing the number of wakeups. Its 'exclusive' in the sense that it will stop after it finds the first idle thread. We certainly can employ other wakeup strategies - there was interest (and patches) for a strict 'round robin' but that has not been merged upstream. I would like to better understand the current usecase. It sounds like each thread as an epoll file descriptor. And each epoll file descriptor is attached the name netlink socket. But when that netlink socket gets a packet it causes all the threads to wakeup? Are you sure there is just 1 netlink socket that all epoll file desciptors are are attached to? Thanks, -Jason ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/10/19 3:09 PM, Jason Baron wrote: > Hi David, > > The idea is that we try and queue new work to 'idle' threads in an > attempt to distribute a workload. Thus, once we find an 'idle' thread we > stop waking up other threads. While we are searching the wakeup list for > idle threads, we do queue an epoll event to the non-idle threads, this > doesn't mean they are woken up. It just means that when they go to > epoll_wait() to harvest events from the kernel, if the event is still > available it will be reported. If the condition for the event is no > longer true (because another thread consumed it), they the event > wouldn't be visible. So its a way of load balancing a workload while > also reducing the number of wakeups. Its 'exclusive' in the sense that > it will stop after it finds the first idle thread. > > We certainly can employ other wakeup strategies - there was interest > (and patches) for a strict 'round robin' but that has not been merged > upstream. > > I would like to better understand the current usecase. It sounds like > each thread as an epoll file descriptor. And each epoll file descriptor > is attached the name netlink socket. But when that netlink socket gets a > packet it causes all the threads to wakeup? Are you sure there is just 1 > netlink socket that all epoll file desciptors are are attached to? > Thanks for the response. This is the code in question: https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 Yes, prior to finding the above code reference I had traced it to a single socket with all handler threads (71 threads on this 96 cpu box) on the wait queue. The ovs kernel module is punting a packet to userspace. It generates a netlink message and invokes netlink_unicast. This the stack trace: ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) ad257275 pollwake+0x75 ([kernel.kallsyms]) ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) ad28a4bc ep_call_nested.constprop.18+0xbc ([kernel.kallsyms]) ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) ad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) A probe on sock_def_readable shows it is a single socket that the wait queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once for each thread). In some cases I see the awakened threads immediately running on the target_cpu, while the queue walk continues to wake up all threads. Only the first one is going to handle the packet so the rest of the wakeups are just noise. On this system in just a 1-second interval I see this sequence play out 400+ times. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/10/19 2:20 PM, Matteo Croce wrote: > > Before this patch (which unfortunately is needed to avoid -EMFILE > errors with many ports), how many sockets are awakened when an ARP is > received? > on systems using 2.7.3 I see only a single handler thread awakened on upcalls. Yes, I saw the commit message on the change. At the moment, the socket descriptor scaling problem has morphed into a thundering herd scheduling problem with the added overhead of waking up all of threads in the packet path. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On Tue, Dec 10, 2019 at 10:00 PM David Ahern wrote: > > [ adding Jason as author of the patch that added the epoll exclusive flag ] > > On 12/10/19 12:37 PM, Matteo Croce wrote: > > On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: > >> > >> Hi Matteo: > >> > >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > >> thundering herd wake up problem. Every packet punted to userspace wakes > >> up every one of the handler threads. On a box with 96 cpus, there are 71 > >> handler threads which means 71 process wakeups for every packet punted. > >> > >> This is really easy to see, just watch sched:sched_wakeup tracepoints. > >> With a few extra probes: > >> > >> perf probe sock_def_readable sk=%di > >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > >> wake_flags=%cx key=%8 > >> > >> you can see there is a single netlink socket and its wait queue contains > >> an entry for every handler thread. > >> > >> This does not happen with the 2.7.3 version. Roaming commits it appears > >> that the change in behavior comes from this commit: > >> > >> commit 69c51582ff786a68fc325c1c50624715482bc460 > >> Author: Matteo Croce > >> Date: Tue Sep 25 10:51:05 2018 +0200 > >> > >> dpif-netlink: don't allocate per thread netlink sockets > >> > >> > >> Is this a known problem? > >> > >> David > >> > > > > Hi David, > > > > before my patch, vswitchd created NxM sockets, being N the ports and M > > the active cores, > > because every thread opens a netlink socket per port. > > > > With my patch, a pool is created with N socket, one per port, and all > > the threads polls the same list > > with the EPOLLEXCLUSIVE flag. > > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > > of the waiting threads. > > > > I'm not aware of this problem, but it goes against the intended > > behaviour of EPOLLEXCLUSIVE. > > Such flag exists since Linux 4.5, can you check that it's passed > > correctly to epoll()? > > > > This the commit that added the EXCLUSIVE flag: > > commit df0108c5da561c66c333bb46bfe3c1fc65905898 > Author: Jason Baron > Date: Wed Jan 20 14:59:24 2016 -0800 > > epoll: add EPOLLEXCLUSIVE flag > > > The commit message acknowledges that multiple threads can still be awakened: > > "The implementation walks the list of exclusive waiters, and queues an > event to each epfd, until it finds the first waiter that has threads > blocked on it via epoll_wait(). The idea is to search for threads which > are idle and ready to process the wakeup events. Thus, we queue an > event to at least 1 epfd, but may still potentially queue an event to > all epfds that are attached to the shared fd source." > > To me that means all idle handler threads are going to be awakened on > each upcall message even though only 1 is needed to handle the message. > > Jason: What was the rationale behind the exclusive flag that still wakes > up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N > handler threads awakened on every single event which is a horrible > scaling property. > Actually, I didn't look at that commit message, but I read the epoll_ctl manpage which says: "When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epoll file descriptors to receive an event. EPOLLEXCLUSIVE is thus useful for avoidâ ing thundering herd problems in certain scenarios." I'd expect "one or more" to be probably greater than 1, but still much lower than all. Before this patch (which unfortunately is needed to avoid -EMFILE errors with many ports), how many sockets are awakened when an ARP is received? Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
[ adding Jason as author of the patch that added the epoll exclusive flag ] On 12/10/19 12:37 PM, Matteo Croce wrote: > On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: >> >> Hi Matteo: >> >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a >> thundering herd wake up problem. Every packet punted to userspace wakes >> up every one of the handler threads. On a box with 96 cpus, there are 71 >> handler threads which means 71 process wakeups for every packet punted. >> >> This is really easy to see, just watch sched:sched_wakeup tracepoints. >> With a few extra probes: >> >> perf probe sock_def_readable sk=%di >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx >> wake_flags=%cx key=%8 >> >> you can see there is a single netlink socket and its wait queue contains >> an entry for every handler thread. >> >> This does not happen with the 2.7.3 version. Roaming commits it appears >> that the change in behavior comes from this commit: >> >> commit 69c51582ff786a68fc325c1c50624715482bc460 >> Author: Matteo Croce >> Date: Tue Sep 25 10:51:05 2018 +0200 >> >> dpif-netlink: don't allocate per thread netlink sockets >> >> >> Is this a known problem? >> >> David >> > > Hi David, > > before my patch, vswitchd created NxM sockets, being N the ports and M > the active cores, > because every thread opens a netlink socket per port. > > With my patch, a pool is created with N socket, one per port, and all > the threads polls the same list > with the EPOLLEXCLUSIVE flag. > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > of the waiting threads. > > I'm not aware of this problem, but it goes against the intended > behaviour of EPOLLEXCLUSIVE. > Such flag exists since Linux 4.5, can you check that it's passed > correctly to epoll()? > This the commit that added the EXCLUSIVE flag: commit df0108c5da561c66c333bb46bfe3c1fc65905898 Author: Jason Baron Date: Wed Jan 20 14:59:24 2016 -0800 epoll: add EPOLLEXCLUSIVE flag The commit message acknowledges that multiple threads can still be awakened: "The implementation walks the list of exclusive waiters, and queues an event to each epfd, until it finds the first waiter that has threads blocked on it via epoll_wait(). The idea is to search for threads which are idle and ready to process the wakeup events. Thus, we queue an event to at least 1 epfd, but may still potentially queue an event to all epfds that are attached to the shared fd source." To me that means all idle handler threads are going to be awakened on each upcall message even though only 1 is needed to handle the message. Jason: What was the rationale behind the exclusive flag that still wakes up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N handler threads awakened on every single event which is a horrible scaling property. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On Tue, Dec 10, 2019 at 8:41 PM David Ahern wrote: > > On 12/10/19 12:37 PM, Matteo Croce wrote: > > On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: > >> > >> Hi Matteo: > >> > >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > >> thundering herd wake up problem. Every packet punted to userspace wakes > >> up every one of the handler threads. On a box with 96 cpus, there are 71 > >> handler threads which means 71 process wakeups for every packet punted. > >> > >> This is really easy to see, just watch sched:sched_wakeup tracepoints. > >> With a few extra probes: > >> > >> perf probe sock_def_readable sk=%di > >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > >> wake_flags=%cx key=%8 > >> > >> you can see there is a single netlink socket and its wait queue contains > >> an entry for every handler thread. > >> > >> This does not happen with the 2.7.3 version. Roaming commits it appears > >> that the change in behavior comes from this commit: > >> > >> commit 69c51582ff786a68fc325c1c50624715482bc460 > >> Author: Matteo Croce > >> Date: Tue Sep 25 10:51:05 2018 +0200 > >> > >> dpif-netlink: don't allocate per thread netlink sockets > >> > >> > >> Is this a known problem? > >> > >> David > >> > > > > Hi David, > > > > before my patch, vswitchd created NxM sockets, being N the ports and M > > the active cores, > > because every thread opens a netlink socket per port. > > > > With my patch, a pool is created with N socket, one per port, and all > > the threads polls the same list > > with the EPOLLEXCLUSIVE flag. > > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > > of the waiting threads. > > > > I'm not aware of this problem, but it goes against the intended > > behaviour of EPOLLEXCLUSIVE. > > Such flag exists since Linux 4.5, can you check that it's passed > > correctly to epoll()? > > > > I get the theory, but the reality is that all threads are awakened. > Also, it is not limited to the 4.14 kernel; I see the same behavior with > 5.4. > So all threads are awakened, even if there is only an upcall packet to read? This is not good, epoll() should wake just one thread at time, as the manpage says. I have to check this, do you have a minimal setup? How do you generate upcalls, it's BUM traffic or via action=userspace? Bye, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On 12/10/19 12:37 PM, Matteo Croce wrote: > On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: >> >> Hi Matteo: >> >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a >> thundering herd wake up problem. Every packet punted to userspace wakes >> up every one of the handler threads. On a box with 96 cpus, there are 71 >> handler threads which means 71 process wakeups for every packet punted. >> >> This is really easy to see, just watch sched:sched_wakeup tracepoints. >> With a few extra probes: >> >> perf probe sock_def_readable sk=%di >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx >> wake_flags=%cx key=%8 >> >> you can see there is a single netlink socket and its wait queue contains >> an entry for every handler thread. >> >> This does not happen with the 2.7.3 version. Roaming commits it appears >> that the change in behavior comes from this commit: >> >> commit 69c51582ff786a68fc325c1c50624715482bc460 >> Author: Matteo Croce >> Date: Tue Sep 25 10:51:05 2018 +0200 >> >> dpif-netlink: don't allocate per thread netlink sockets >> >> >> Is this a known problem? >> >> David >> > > Hi David, > > before my patch, vswitchd created NxM sockets, being N the ports and M > the active cores, > because every thread opens a netlink socket per port. > > With my patch, a pool is created with N socket, one per port, and all > the threads polls the same list > with the EPOLLEXCLUSIVE flag. > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > of the waiting threads. > > I'm not aware of this problem, but it goes against the intended > behaviour of EPOLLEXCLUSIVE. > Such flag exists since Linux 4.5, can you check that it's passed > correctly to epoll()? > I get the theory, but the reality is that all threads are awakened. Also, it is not limited to the 4.14 kernel; I see the same behavior with 5.4. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: > > Hi Matteo: > > On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > thundering herd wake up problem. Every packet punted to userspace wakes > up every one of the handler threads. On a box with 96 cpus, there are 71 > handler threads which means 71 process wakeups for every packet punted. > > This is really easy to see, just watch sched:sched_wakeup tracepoints. > With a few extra probes: > > perf probe sock_def_readable sk=%di > perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > wake_flags=%cx key=%8 > > you can see there is a single netlink socket and its wait queue contains > an entry for every handler thread. > > This does not happen with the 2.7.3 version. Roaming commits it appears > that the change in behavior comes from this commit: > > commit 69c51582ff786a68fc325c1c50624715482bc460 > Author: Matteo Croce > Date: Tue Sep 25 10:51:05 2018 +0200 > > dpif-netlink: don't allocate per thread netlink sockets > > > Is this a known problem? > > David > Hi David, before my patch, vswitchd created NxM sockets, being N the ports and M the active cores, because every thread opens a netlink socket per port. With my patch, a pool is created with N socket, one per port, and all the threads polls the same list with the EPOLLEXCLUSIVE flag. As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one of the waiting threads. I'm not aware of this problem, but it goes against the intended behaviour of EPOLLEXCLUSIVE. Such flag exists since Linux 4.5, can you check that it's passed correctly to epoll()? Bye, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev