Re: [ovs-dev] thundering herd wakeup of handler threads

2020-03-12 Thread Aaron Conole
David Ahern  writes:

> On 1/8/20 9:18 AM, Aaron Conole wrote:
>> David Ahern  writes:
>> 
>>> On 12/16/19 2:42 PM, Aaron Conole wrote:
 Can you try the following and see if your scalability issue is
 addressed?  I think it could be better integrated, but this is a
 different quick 'n dirty.
>>>
>>> your patch reduces the number of threads awakened, but it is still
>>> really high - 43 out of 71 handler threads in one test.
>> 
>> Thanks for getting back to me (sorry for the late response, catching
>> up).  I'll take a closer look.  I haven't thought about adding
>> EPOLLEXCLUSIVE to the poll block change, so maybe that will address it
>> completely, or maybe there's a different approach.
>> 
>
> This seems to have gone quiet for a few months now. In case my previous
> comments weren't clear, the change in question
> (69c51582ff786a68fc325c1c50624715482bc460) converted a control plane
> scaling problem into a datapath scaling problem. It manifests as an
> increase in squeezed counts and dropped packets (both at the enqueue
> point and at hand off to the vhost threads) because of the increased
> time spent on upcalls.
>
> I have reverted that commit and several on top of it for our purposes,
> but this really is something OVS devs should fix.

There is a patch set in the works that I think should help:

  https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/368139.html
  https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/368138.html

I will take a closer look at it.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2020-03-11 Thread David Ahern
On 1/8/20 9:18 AM, Aaron Conole wrote:
> David Ahern  writes:
> 
>> On 12/16/19 2:42 PM, Aaron Conole wrote:
>>> Can you try the following and see if your scalability issue is
>>> addressed?  I think it could be better integrated, but this is a
>>> different quick 'n dirty.
>>
>> your patch reduces the number of threads awakened, but it is still
>> really high - 43 out of 71 handler threads in one test.
> 
> Thanks for getting back to me (sorry for the late response, catching
> up).  I'll take a closer look.  I haven't thought about adding
> EPOLLEXCLUSIVE to the poll block change, so maybe that will address it
> completely, or maybe there's a different approach.
> 

This seems to have gone quiet for a few months now. In case my previous
comments weren't clear, the change in question
(69c51582ff786a68fc325c1c50624715482bc460) converted a control plane
scaling problem into a datapath scaling problem. It manifests as an
increase in squeezed counts and dropped packets (both at the enqueue
point and at hand off to the vhost threads) because of the increased
time spent on upcalls.

I have reverted that commit and several on top of it for our purposes,
but this really is something OVS devs should fix.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2020-01-08 Thread Aaron Conole
David Ahern  writes:

> On 12/16/19 2:42 PM, Aaron Conole wrote:
>> Can you try the following and see if your scalability issue is
>> addressed?  I think it could be better integrated, but this is a
>> different quick 'n dirty.
>
> your patch reduces the number of threads awakened, but it is still
> really high - 43 out of 71 handler threads in one test.

Thanks for getting back to me (sorry for the late response, catching
up).  I'll take a closer look.  I haven't thought about adding
EPOLLEXCLUSIVE to the poll block change, so maybe that will address it
completely, or maybe there's a different approach.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-20 Thread David Ahern
On 12/16/19 2:42 PM, Aaron Conole wrote:
> Can you try the following and see if your scalability issue is
> addressed?  I think it could be better integrated, but this is a
> different quick 'n dirty.

your patch reduces the number of threads awakened, but it is still
really high - 43 out of 71 handler threads in one test.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-16 Thread David Ahern
On 12/16/19 2:42 PM, Aaron Conole wrote:
> Can you try the following and see if your scalability issue is
> addressed?  I think it could be better integrated, but this is a
> different quick 'n dirty.

I'll to get to it before the end of the week.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-16 Thread Aaron Conole
David Ahern  writes:

> On 12/13/19 1:52 PM, Aaron Conole wrote:
>> Jason Baron via dev  writes:
>> 
>>> On 12/10/19 5:20 PM, David Ahern wrote:
 On 12/10/19 3:09 PM, Jason Baron wrote:
> Hi David,
>
> The idea is that we try and queue new work to 'idle' threads in an
> attempt to distribute a workload. Thus, once we find an 'idle' thread we
> stop waking up other threads. While we are searching the wakeup list for
> idle threads, we do queue an epoll event to the non-idle threads, this
> doesn't mean they are woken up. It just means that when they go to
> epoll_wait() to harvest events from the kernel, if the event is still
> available it will be reported. If the condition for the event is no
> longer true (because another thread consumed it), they the event
> wouldn't be visible. So its a way of load balancing a workload while
> also reducing the number of wakeups. Its 'exclusive' in the sense that
> it will stop after it finds the first idle thread.
>
> We certainly can employ other wakeup strategies - there was interest
> (and patches) for a strict 'round robin' but that has not been merged
> upstream.
>
> I would like to better understand the current usecase. It sounds like
> each thread as an epoll file descriptor. And each epoll file descriptor
> is attached the name netlink socket. But when that netlink socket gets a
> packet it causes all the threads to wakeup? Are you sure there is just 1
> netlink socket that all epoll file desciptors are are attached to?
>

 Thanks for the response.

 This is the code in question:

 https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492

 Yes, prior to finding the above code reference I had traced it to a
 single socket with all handler threads (71 threads on this 96 cpu box)
 on the wait queue.

 The ovs kernel module is punting a packet to userspace. It generates a
 netlink message and invokes netlink_unicast. This the stack trace:

 ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
 ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
 ad257275 pollwake+0x75 ([kernel.kallsyms])
 ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
 ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
 ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
 ad28a4bc ep_call_nested.constprop.18+0xbc
 ([kernel.kallsyms])
 ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
 ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
 ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
 ad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
 ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
 ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
 c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
 c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])


 A probe on sock_def_readable shows it is a single socket that the wait
 queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
 for each thread). In some cases I see the awakened threads immediately
 running on the target_cpu, while the queue walk continues to wake up all
 threads. Only the first one is going to handle the packet so the rest of
 the wakeups are just noise.

 On this system in just a 1-second interval I see this sequence play out
 400+ times.

>>>
>>>
>>> That stack trace is interesting. I *think* it means you are doing
>>> select() or poll() on the epoll file descriptors? I say that because I
>>> see 'pollwake()' in the stack. Notice that pollwake() is only in
>>> fs/select.c.
>> 
>> During ofproto upcall processing, if there aren't any upcalls to
>> receive, a dpif_recv_wait() is called, which will call into
>> dpif_netlink_recv_wait() which will set up the next poll_block() call.
>> 
>> This eventually uses the poll() interface (through time_poll() call) .
>> 
>> Maybe we can confirm the thundering herd by just making a quick hack to
>> shunt out the poll block (included).  Just a quick 'n dirty hack to test
>> with.
>> 
>
> Thanks for the patch.
>
> There is nothing special to be done my end to see the problem and figure
> out solutions. Just run ovs 2.11, have traffic flows that cause upcalls,
> and then watch the scheduling tracepoints.
>
> e.g.,
> sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192
> -- sleep 1
>
> sudo perf sched timehist -w
>
> That output shows on a given CPU the wakeup of every one of the handler
> threads. record with -g to get stack traces and see that it is due to
> the upcall. If you still are not convinced add a probe
>
> perf probe sock_

Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-13 Thread David Ahern
On 12/13/19 1:52 PM, Aaron Conole wrote:
> Jason Baron via dev  writes:
> 
>> On 12/10/19 5:20 PM, David Ahern wrote:
>>> On 12/10/19 3:09 PM, Jason Baron wrote:
 Hi David,

 The idea is that we try and queue new work to 'idle' threads in an
 attempt to distribute a workload. Thus, once we find an 'idle' thread we
 stop waking up other threads. While we are searching the wakeup list for
 idle threads, we do queue an epoll event to the non-idle threads, this
 doesn't mean they are woken up. It just means that when they go to
 epoll_wait() to harvest events from the kernel, if the event is still
 available it will be reported. If the condition for the event is no
 longer true (because another thread consumed it), they the event
 wouldn't be visible. So its a way of load balancing a workload while
 also reducing the number of wakeups. Its 'exclusive' in the sense that
 it will stop after it finds the first idle thread.

 We certainly can employ other wakeup strategies - there was interest
 (and patches) for a strict 'round robin' but that has not been merged
 upstream.

 I would like to better understand the current usecase. It sounds like
 each thread as an epoll file descriptor. And each epoll file descriptor
 is attached the name netlink socket. But when that netlink socket gets a
 packet it causes all the threads to wakeup? Are you sure there is just 1
 netlink socket that all epoll file desciptors are are attached to?

>>>
>>> Thanks for the response.
>>>
>>> This is the code in question:
>>>
>>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492
>>>
>>> Yes, prior to finding the above code reference I had traced it to a
>>> single socket with all handler threads (71 threads on this 96 cpu box)
>>> on the wait queue.
>>>
>>> The ovs kernel module is punting a packet to userspace. It generates a
>>> netlink message and invokes netlink_unicast. This the stack trace:
>>>
>>> ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
>>> ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
>>> ad257275 pollwake+0x75 ([kernel.kallsyms])
>>> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>> ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
>>> ad28a4bc ep_call_nested.constprop.18+0xbc
>>> ([kernel.kallsyms])
>>> ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
>>> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>> ad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
>>> ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
>>> ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
>>> c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
>>> c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])
>>>
>>>
>>> A probe on sock_def_readable shows it is a single socket that the wait
>>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
>>> for each thread). In some cases I see the awakened threads immediately
>>> running on the target_cpu, while the queue walk continues to wake up all
>>> threads. Only the first one is going to handle the packet so the rest of
>>> the wakeups are just noise.
>>>
>>> On this system in just a 1-second interval I see this sequence play out
>>> 400+ times.
>>>
>>
>>
>> That stack trace is interesting. I *think* it means you are doing
>> select() or poll() on the epoll file descriptors? I say that because I
>> see 'pollwake()' in the stack. Notice that pollwake() is only in
>> fs/select.c.
> 
> During ofproto upcall processing, if there aren't any upcalls to
> receive, a dpif_recv_wait() is called, which will call into
> dpif_netlink_recv_wait() which will set up the next poll_block() call.
> 
> This eventually uses the poll() interface (through time_poll() call) .
> 
> Maybe we can confirm the thundering herd by just making a quick hack to
> shunt out the poll block (included).  Just a quick 'n dirty hack to test
> with.
> 

Thanks for the patch.

There is nothing special to be done my end to see the problem and figure
out solutions. Just run ovs 2.11, have traffic flows that cause upcalls,
and then watch the scheduling tracepoints.

e.g.,
sudo perf record -c1 -e sched:sched_switch,sched:sched_wakeup -a -m 8192
-- sleep 1

sudo perf sched timehist -w

That output shows on a given CPU the wakeup of every one of the handler
threads. record with -g to get stack traces and see that it is due to
the upcall. If you still are not convinced add a probe

perf probe sock_def_readable sk=%di

and add -e probe:* to the record line. you will see 1 hit on
sock_def_readable and all of the handler wakeups from there

Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-13 Thread Aaron Conole
Jason Baron via dev  writes:

> On 12/10/19 5:20 PM, David Ahern wrote:
>> On 12/10/19 3:09 PM, Jason Baron wrote:
>>> Hi David,
>>>
>>> The idea is that we try and queue new work to 'idle' threads in an
>>> attempt to distribute a workload. Thus, once we find an 'idle' thread we
>>> stop waking up other threads. While we are searching the wakeup list for
>>> idle threads, we do queue an epoll event to the non-idle threads, this
>>> doesn't mean they are woken up. It just means that when they go to
>>> epoll_wait() to harvest events from the kernel, if the event is still
>>> available it will be reported. If the condition for the event is no
>>> longer true (because another thread consumed it), they the event
>>> wouldn't be visible. So its a way of load balancing a workload while
>>> also reducing the number of wakeups. Its 'exclusive' in the sense that
>>> it will stop after it finds the first idle thread.
>>>
>>> We certainly can employ other wakeup strategies - there was interest
>>> (and patches) for a strict 'round robin' but that has not been merged
>>> upstream.
>>>
>>> I would like to better understand the current usecase. It sounds like
>>> each thread as an epoll file descriptor. And each epoll file descriptor
>>> is attached the name netlink socket. But when that netlink socket gets a
>>> packet it causes all the threads to wakeup? Are you sure there is just 1
>>> netlink socket that all epoll file desciptors are are attached to?
>>>
>> 
>> Thanks for the response.
>> 
>> This is the code in question:
>> 
>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492
>> 
>> Yes, prior to finding the above code reference I had traced it to a
>> single socket with all handler threads (71 threads on this 96 cpu box)
>> on the wait queue.
>> 
>> The ovs kernel module is punting a packet to userspace. It generates a
>> netlink message and invokes netlink_unicast. This the stack trace:
>> 
>> ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
>> ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
>> ad257275 pollwake+0x75 ([kernel.kallsyms])
>> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>> ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
>> ad28a4bc ep_call_nested.constprop.18+0xbc
>> ([kernel.kallsyms])
>> ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
>> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>> ad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
>> ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
>> ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
>> c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
>> c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])
>> 
>> 
>> A probe on sock_def_readable shows it is a single socket that the wait
>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
>> for each thread). In some cases I see the awakened threads immediately
>> running on the target_cpu, while the queue walk continues to wake up all
>> threads. Only the first one is going to handle the packet so the rest of
>> the wakeups are just noise.
>> 
>> On this system in just a 1-second interval I see this sequence play out
>> 400+ times.
>> 
>
>
> That stack trace is interesting. I *think* it means you are doing
> select() or poll() on the epoll file descriptors? I say that because I
> see 'pollwake()' in the stack. Notice that pollwake() is only in
> fs/select.c.

During ofproto upcall processing, if there aren't any upcalls to
receive, a dpif_recv_wait() is called, which will call into
dpif_netlink_recv_wait() which will set up the next poll_block() call.

This eventually uses the poll() interface (through time_poll() call) .

Maybe we can confirm the thundering herd by just making a quick hack to
shunt out the poll block (included).  Just a quick 'n dirty hack to test
with.

> Thanks,
>
> -jason

NOTE: this change will trigger warnings about unused parameters.  But it
  will prevent us from registering on the poll_loop for the thread.
  Keep in mind there will be high cpu utilization since the
  poll_block will never sleep (and this is a hack... so there's
  that)

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index d1f9b81db..804373c3a 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2709,7 +2709,7 @@ dpif_netlink_recv_wait__(struct dpif_netlink *dpif, 
uint32_t handler_id)
 if (dpif->handlers && handler_id < dpif->n_handlers) {
 struct dpif_handler *handler = &dpif->handlers[handler_id];
 
-poll_fd_wait(handler->epoll_fd, POLLIN);
+poll_immediate_wake(); /* now we never sleep or poll the handler */
 }
 #endif
 }

Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread Jason Baron via dev



On 12/10/19 5:20 PM, David Ahern wrote:
> On 12/10/19 3:09 PM, Jason Baron wrote:
>> Hi David,
>>
>> The idea is that we try and queue new work to 'idle' threads in an
>> attempt to distribute a workload. Thus, once we find an 'idle' thread we
>> stop waking up other threads. While we are searching the wakeup list for
>> idle threads, we do queue an epoll event to the non-idle threads, this
>> doesn't mean they are woken up. It just means that when they go to
>> epoll_wait() to harvest events from the kernel, if the event is still
>> available it will be reported. If the condition for the event is no
>> longer true (because another thread consumed it), they the event
>> wouldn't be visible. So its a way of load balancing a workload while
>> also reducing the number of wakeups. Its 'exclusive' in the sense that
>> it will stop after it finds the first idle thread.
>>
>> We certainly can employ other wakeup strategies - there was interest
>> (and patches) for a strict 'round robin' but that has not been merged
>> upstream.
>>
>> I would like to better understand the current usecase. It sounds like
>> each thread as an epoll file descriptor. And each epoll file descriptor
>> is attached the name netlink socket. But when that netlink socket gets a
>> packet it causes all the threads to wakeup? Are you sure there is just 1
>> netlink socket that all epoll file desciptors are are attached to?
>>
> 
> Thanks for the response.
> 
> This is the code in question:
> 
> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492
> 
> Yes, prior to finding the above code reference I had traced it to a
> single socket with all handler threads (71 threads on this 96 cpu box)
> on the wait queue.
> 
> The ovs kernel module is punting a packet to userspace. It generates a
> netlink message and invokes netlink_unicast. This the stack trace:
> 
> ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
> ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
> ad257275 pollwake+0x75 ([kernel.kallsyms])
> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
> ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
> ad28a4bc ep_call_nested.constprop.18+0xbc
> ([kernel.kallsyms])
> ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
> ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
> ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
> ad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
> ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
> ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
> c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
> c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])
> 
> 
> A probe on sock_def_readable shows it is a single socket that the wait
> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
> for each thread). In some cases I see the awakened threads immediately
> running on the target_cpu, while the queue walk continues to wake up all
> threads. Only the first one is going to handle the packet so the rest of
> the wakeups are just noise.
> 
> On this system in just a 1-second interval I see this sequence play out
> 400+ times.
> 


That stack trace is interesting. I *think* it means you are doing
select() or poll() on the epoll file descriptors? I say that because I
see 'pollwake()' in the stack. Notice that pollwake() is only in
fs/select.c.

Thanks,

-jason
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread Jason Baron via dev



On 12/10/19 4:00 PM, David Ahern wrote:
> [ adding Jason as author of the patch that added the epoll exclusive flag ]
> 
> On 12/10/19 12:37 PM, Matteo Croce wrote:
>> On Tue, Dec 10, 2019 at 8:13 PM David Ahern  wrote:
>>>
>>> Hi Matteo:
>>>
>>> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
>>> thundering herd wake up problem. Every packet punted to userspace wakes
>>> up every one of the handler threads. On a box with 96 cpus, there are 71
>>> handler threads which means 71 process wakeups for every packet punted.
>>>
>>> This is really easy to see, just watch sched:sched_wakeup tracepoints.
>>> With a few extra probes:
>>>
>>> perf probe sock_def_readable sk=%di
>>> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
>>> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
>>> wake_flags=%cx key=%8
>>>
>>> you can see there is a single netlink socket and its wait queue contains
>>> an entry for every handler thread.
>>>
>>> This does not happen with the 2.7.3 version. Roaming commits it appears
>>> that the change in behavior comes from this commit:
>>>
>>> commit 69c51582ff786a68fc325c1c50624715482bc460
>>> Author: Matteo Croce 
>>> Date:   Tue Sep 25 10:51:05 2018 +0200
>>>
>>> dpif-netlink: don't allocate per thread netlink sockets
>>>
>>>
>>> Is this a known problem?
>>>
>>> David
>>>
>>
>> Hi David,
>>
>> before my patch, vswitchd created NxM sockets, being N the ports and M
>> the active cores,
>> because every thread opens a netlink socket per port.
>>
>> With my patch, a pool is created with N socket, one per port, and all
>> the threads polls the same list
>> with the EPOLLEXCLUSIVE flag.
>> As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
>> of the waiting threads.
>>
>> I'm not aware of this problem, but it goes against the intended
>> behaviour of EPOLLEXCLUSIVE.
>> Such flag exists since Linux 4.5, can you check that it's passed
>> correctly to epoll()?
>>
> 
> This the commit that added the EXCLUSIVE flag:
> 
> commit df0108c5da561c66c333bb46bfe3c1fc65905898
> Author: Jason Baron 
> Date:   Wed Jan 20 14:59:24 2016 -0800
> 
> epoll: add EPOLLEXCLUSIVE flag
> 
> 
> The commit message acknowledges that multiple threads can still be awakened:
> 
> "The implementation walks the list of exclusive waiters, and queues an
> event to each epfd, until it finds the first waiter that has threads
> blocked on it via epoll_wait().  The idea is to search for threads which
> are idle and ready to process the wakeup events.  Thus, we queue an
> event to at least 1 epfd, but may still potentially queue an event to
> all epfds that are attached to the shared fd source."
> 
> To me that means all idle handler threads are going to be awakened on
> each upcall message even though only 1 is needed to handle the message.
> 
> Jason: What was the rationale behind the exclusive flag that still wakes
> up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N
> handler threads awakened on every single event which is a horrible
> scaling property.
> 

Hi David,

The idea is that we try and queue new work to 'idle' threads in an
attempt to distribute a workload. Thus, once we find an 'idle' thread we
stop waking up other threads. While we are searching the wakeup list for
idle threads, we do queue an epoll event to the non-idle threads, this
doesn't mean they are woken up. It just means that when they go to
epoll_wait() to harvest events from the kernel, if the event is still
available it will be reported. If the condition for the event is no
longer true (because another thread consumed it), they the event
wouldn't be visible. So its a way of load balancing a workload while
also reducing the number of wakeups. Its 'exclusive' in the sense that
it will stop after it finds the first idle thread.

We certainly can employ other wakeup strategies - there was interest
(and patches) for a strict 'round robin' but that has not been merged
upstream.

I would like to better understand the current usecase. It sounds like
each thread as an epoll file descriptor. And each epoll file descriptor
is attached the name netlink socket. But when that netlink socket gets a
packet it causes all the threads to wakeup? Are you sure there is just 1
netlink socket that all epoll file desciptors are are attached to?

Thanks,

-Jason



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread David Ahern
On 12/10/19 3:09 PM, Jason Baron wrote:
> Hi David,
> 
> The idea is that we try and queue new work to 'idle' threads in an
> attempt to distribute a workload. Thus, once we find an 'idle' thread we
> stop waking up other threads. While we are searching the wakeup list for
> idle threads, we do queue an epoll event to the non-idle threads, this
> doesn't mean they are woken up. It just means that when they go to
> epoll_wait() to harvest events from the kernel, if the event is still
> available it will be reported. If the condition for the event is no
> longer true (because another thread consumed it), they the event
> wouldn't be visible. So its a way of load balancing a workload while
> also reducing the number of wakeups. Its 'exclusive' in the sense that
> it will stop after it finds the first idle thread.
> 
> We certainly can employ other wakeup strategies - there was interest
> (and patches) for a strict 'round robin' but that has not been merged
> upstream.
> 
> I would like to better understand the current usecase. It sounds like
> each thread as an epoll file descriptor. And each epoll file descriptor
> is attached the name netlink socket. But when that netlink socket gets a
> packet it causes all the threads to wakeup? Are you sure there is just 1
> netlink socket that all epoll file desciptors are are attached to?
> 

Thanks for the response.

This is the code in question:

https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492

Yes, prior to finding the above code reference I had traced it to a
single socket with all handler threads (71 threads on this 96 cpu box)
on the wait queue.

The ovs kernel module is punting a packet to userspace. It generates a
netlink message and invokes netlink_unicast. This the stack trace:

ad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
ad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
ad257275 pollwake+0x75 ([kernel.kallsyms])
ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
ad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
ad28a4bc ep_call_nested.constprop.18+0xbc
([kernel.kallsyms])
ad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
ad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
ad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
ad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
ad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
ad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
c07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
c07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])


A probe on sock_def_readable shows it is a single socket that the wait
queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
for each thread). In some cases I see the awakened threads immediately
running on the target_cpu, while the queue walk continues to wake up all
threads. Only the first one is going to handle the packet so the rest of
the wakeups are just noise.

On this system in just a 1-second interval I see this sequence play out
400+ times.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread David Ahern
On 12/10/19 2:20 PM, Matteo Croce wrote:
> 
> Before this patch (which unfortunately is needed to avoid -EMFILE
> errors with many ports), how many sockets are awakened when an ARP is
> received?
> 

on systems using 2.7.3 I see only a single handler thread awakened on
upcalls.

Yes, I saw the commit message on the change. At the moment, the socket
descriptor scaling problem has morphed into a thundering herd scheduling
problem with the added overhead of waking up all of threads in the
packet path.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread Matteo Croce
On Tue, Dec 10, 2019 at 10:00 PM David Ahern  wrote:
>
> [ adding Jason as author of the patch that added the epoll exclusive flag ]
>
> On 12/10/19 12:37 PM, Matteo Croce wrote:
> > On Tue, Dec 10, 2019 at 8:13 PM David Ahern  wrote:
> >>
> >> Hi Matteo:
> >>
> >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
> >> thundering herd wake up problem. Every packet punted to userspace wakes
> >> up every one of the handler threads. On a box with 96 cpus, there are 71
> >> handler threads which means 71 process wakeups for every packet punted.
> >>
> >> This is really easy to see, just watch sched:sched_wakeup tracepoints.
> >> With a few extra probes:
> >>
> >> perf probe sock_def_readable sk=%di
> >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
> >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
> >> wake_flags=%cx key=%8
> >>
> >> you can see there is a single netlink socket and its wait queue contains
> >> an entry for every handler thread.
> >>
> >> This does not happen with the 2.7.3 version. Roaming commits it appears
> >> that the change in behavior comes from this commit:
> >>
> >> commit 69c51582ff786a68fc325c1c50624715482bc460
> >> Author: Matteo Croce 
> >> Date:   Tue Sep 25 10:51:05 2018 +0200
> >>
> >> dpif-netlink: don't allocate per thread netlink sockets
> >>
> >>
> >> Is this a known problem?
> >>
> >> David
> >>
> >
> > Hi David,
> >
> > before my patch, vswitchd created NxM sockets, being N the ports and M
> > the active cores,
> > because every thread opens a netlink socket per port.
> >
> > With my patch, a pool is created with N socket, one per port, and all
> > the threads polls the same list
> > with the EPOLLEXCLUSIVE flag.
> > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
> > of the waiting threads.
> >
> > I'm not aware of this problem, but it goes against the intended
> > behaviour of EPOLLEXCLUSIVE.
> > Such flag exists since Linux 4.5, can you check that it's passed
> > correctly to epoll()?
> >
>
> This the commit that added the EXCLUSIVE flag:
>
> commit df0108c5da561c66c333bb46bfe3c1fc65905898
> Author: Jason Baron 
> Date:   Wed Jan 20 14:59:24 2016 -0800
>
> epoll: add EPOLLEXCLUSIVE flag
>
>
> The commit message acknowledges that multiple threads can still be awakened:
>
> "The implementation walks the list of exclusive waiters, and queues an
> event to each epfd, until it finds the first waiter that has threads
> blocked on it via epoll_wait().  The idea is to search for threads which
> are idle and ready to process the wakeup events.  Thus, we queue an
> event to at least 1 epfd, but may still potentially queue an event to
> all epfds that are attached to the shared fd source."
>
> To me that means all idle handler threads are going to be awakened on
> each upcall message even though only 1 is needed to handle the message.
>
> Jason: What was the rationale behind the exclusive flag that still wakes
> up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N
> handler threads awakened on every single event which is a horrible
> scaling property.
>

Actually, I didn't look at that commit message, but I read the
epoll_ctl manpage which says:

"When a wakeup event occurs and multiple epoll file descriptors
are attached to the same target file using EPOLLEXCLUSIVE, one
or more of the epoll file descriptors will receive an event
with epoll_wait(2).  The default in this scenario (when
EPOLLEXCLUSIVE is not set) is for all epoll file descriptors
to receive an event.  EPOLLEXCLUSIVE is thus useful for avoid‐
ing thundering herd problems in certain scenarios."

I'd expect "one or more" to be probably greater than 1, but still much
lower than all.

Before this patch (which unfortunately is needed to avoid -EMFILE
errors with many ports), how many sockets are awakened when an ARP is
received?

Regards,

-- 
Matteo Croce
per aspera ad upstream

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread David Ahern
[ adding Jason as author of the patch that added the epoll exclusive flag ]

On 12/10/19 12:37 PM, Matteo Croce wrote:
> On Tue, Dec 10, 2019 at 8:13 PM David Ahern  wrote:
>>
>> Hi Matteo:
>>
>> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
>> thundering herd wake up problem. Every packet punted to userspace wakes
>> up every one of the handler threads. On a box with 96 cpus, there are 71
>> handler threads which means 71 process wakeups for every packet punted.
>>
>> This is really easy to see, just watch sched:sched_wakeup tracepoints.
>> With a few extra probes:
>>
>> perf probe sock_def_readable sk=%di
>> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
>> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
>> wake_flags=%cx key=%8
>>
>> you can see there is a single netlink socket and its wait queue contains
>> an entry for every handler thread.
>>
>> This does not happen with the 2.7.3 version. Roaming commits it appears
>> that the change in behavior comes from this commit:
>>
>> commit 69c51582ff786a68fc325c1c50624715482bc460
>> Author: Matteo Croce 
>> Date:   Tue Sep 25 10:51:05 2018 +0200
>>
>> dpif-netlink: don't allocate per thread netlink sockets
>>
>>
>> Is this a known problem?
>>
>> David
>>
> 
> Hi David,
> 
> before my patch, vswitchd created NxM sockets, being N the ports and M
> the active cores,
> because every thread opens a netlink socket per port.
> 
> With my patch, a pool is created with N socket, one per port, and all
> the threads polls the same list
> with the EPOLLEXCLUSIVE flag.
> As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
> of the waiting threads.
> 
> I'm not aware of this problem, but it goes against the intended
> behaviour of EPOLLEXCLUSIVE.
> Such flag exists since Linux 4.5, can you check that it's passed
> correctly to epoll()?
> 

This the commit that added the EXCLUSIVE flag:

commit df0108c5da561c66c333bb46bfe3c1fc65905898
Author: Jason Baron 
Date:   Wed Jan 20 14:59:24 2016 -0800

epoll: add EPOLLEXCLUSIVE flag


The commit message acknowledges that multiple threads can still be awakened:

"The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait().  The idea is to search for threads which
are idle and ready to process the wakeup events.  Thus, we queue an
event to at least 1 epfd, but may still potentially queue an event to
all epfds that are attached to the shared fd source."

To me that means all idle handler threads are going to be awakened on
each upcall message even though only 1 is needed to handle the message.

Jason: What was the rationale behind the exclusive flag that still wakes
up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N
handler threads awakened on every single event which is a horrible
scaling property.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread Matteo Croce
On Tue, Dec 10, 2019 at 8:41 PM David Ahern  wrote:
>
> On 12/10/19 12:37 PM, Matteo Croce wrote:
> > On Tue, Dec 10, 2019 at 8:13 PM David Ahern  wrote:
> >>
> >> Hi Matteo:
> >>
> >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
> >> thundering herd wake up problem. Every packet punted to userspace wakes
> >> up every one of the handler threads. On a box with 96 cpus, there are 71
> >> handler threads which means 71 process wakeups for every packet punted.
> >>
> >> This is really easy to see, just watch sched:sched_wakeup tracepoints.
> >> With a few extra probes:
> >>
> >> perf probe sock_def_readable sk=%di
> >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
> >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
> >> wake_flags=%cx key=%8
> >>
> >> you can see there is a single netlink socket and its wait queue contains
> >> an entry for every handler thread.
> >>
> >> This does not happen with the 2.7.3 version. Roaming commits it appears
> >> that the change in behavior comes from this commit:
> >>
> >> commit 69c51582ff786a68fc325c1c50624715482bc460
> >> Author: Matteo Croce 
> >> Date:   Tue Sep 25 10:51:05 2018 +0200
> >>
> >> dpif-netlink: don't allocate per thread netlink sockets
> >>
> >>
> >> Is this a known problem?
> >>
> >> David
> >>
> >
> > Hi David,
> >
> > before my patch, vswitchd created NxM sockets, being N the ports and M
> > the active cores,
> > because every thread opens a netlink socket per port.
> >
> > With my patch, a pool is created with N socket, one per port, and all
> > the threads polls the same list
> > with the EPOLLEXCLUSIVE flag.
> > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
> > of the waiting threads.
> >
> > I'm not aware of this problem, but it goes against the intended
> > behaviour of EPOLLEXCLUSIVE.
> > Such flag exists since Linux 4.5, can you check that it's passed
> > correctly to epoll()?
> >
>
> I get the theory, but the reality is that all threads are awakened.
> Also, it is not limited to the 4.14 kernel; I see the same behavior with
> 5.4.
>

So all threads are awakened, even if there is only an upcall packet to read?
This is not good, epoll() should wake just one thread at time, as the
manpage says.

I have to check this, do you have a minimal setup? How do you generate
upcalls, it's BUM traffic or via action=userspace?

Bye,
-- 
Matteo Croce
per aspera ad upstream

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread David Ahern
On 12/10/19 12:37 PM, Matteo Croce wrote:
> On Tue, Dec 10, 2019 at 8:13 PM David Ahern  wrote:
>>
>> Hi Matteo:
>>
>> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
>> thundering herd wake up problem. Every packet punted to userspace wakes
>> up every one of the handler threads. On a box with 96 cpus, there are 71
>> handler threads which means 71 process wakeups for every packet punted.
>>
>> This is really easy to see, just watch sched:sched_wakeup tracepoints.
>> With a few extra probes:
>>
>> perf probe sock_def_readable sk=%di
>> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
>> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
>> wake_flags=%cx key=%8
>>
>> you can see there is a single netlink socket and its wait queue contains
>> an entry for every handler thread.
>>
>> This does not happen with the 2.7.3 version. Roaming commits it appears
>> that the change in behavior comes from this commit:
>>
>> commit 69c51582ff786a68fc325c1c50624715482bc460
>> Author: Matteo Croce 
>> Date:   Tue Sep 25 10:51:05 2018 +0200
>>
>> dpif-netlink: don't allocate per thread netlink sockets
>>
>>
>> Is this a known problem?
>>
>> David
>>
> 
> Hi David,
> 
> before my patch, vswitchd created NxM sockets, being N the ports and M
> the active cores,
> because every thread opens a netlink socket per port.
> 
> With my patch, a pool is created with N socket, one per port, and all
> the threads polls the same list
> with the EPOLLEXCLUSIVE flag.
> As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
> of the waiting threads.
> 
> I'm not aware of this problem, but it goes against the intended
> behaviour of EPOLLEXCLUSIVE.
> Such flag exists since Linux 4.5, can you check that it's passed
> correctly to epoll()?
> 

I get the theory, but the reality is that all threads are awakened.
Also, it is not limited to the 4.14 kernel; I see the same behavior with
5.4.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] thundering herd wakeup of handler threads

2019-12-10 Thread Matteo Croce
On Tue, Dec 10, 2019 at 8:13 PM David Ahern  wrote:
>
> Hi Matteo:
>
> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
> thundering herd wake up problem. Every packet punted to userspace wakes
> up every one of the handler threads. On a box with 96 cpus, there are 71
> handler threads which means 71 process wakeups for every packet punted.
>
> This is really easy to see, just watch sched:sched_wakeup tracepoints.
> With a few extra probes:
>
> perf probe sock_def_readable sk=%di
> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
> wake_flags=%cx key=%8
>
> you can see there is a single netlink socket and its wait queue contains
> an entry for every handler thread.
>
> This does not happen with the 2.7.3 version. Roaming commits it appears
> that the change in behavior comes from this commit:
>
> commit 69c51582ff786a68fc325c1c50624715482bc460
> Author: Matteo Croce 
> Date:   Tue Sep 25 10:51:05 2018 +0200
>
> dpif-netlink: don't allocate per thread netlink sockets
>
>
> Is this a known problem?
>
> David
>

Hi David,

before my patch, vswitchd created NxM sockets, being N the ports and M
the active cores,
because every thread opens a netlink socket per port.

With my patch, a pool is created with N socket, one per port, and all
the threads polls the same list
with the EPOLLEXCLUSIVE flag.
As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
of the waiting threads.

I'm not aware of this problem, but it goes against the intended
behaviour of EPOLLEXCLUSIVE.
Such flag exists since Linux 4.5, can you check that it's passed
correctly to epoll()?

Bye,

-- 
Matteo Croce
per aspera ad upstream

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev