Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-27 Thread Jason Baron
Hi,

v3 of this series implements this idea using using a different
approach:

http://lkml.iu.edu/hypermail/linux/kernel/1502.3/00667.html

If that still meets your needs it would be helpful to know in
order to move this forward.

Looking back at your posting, I was concerned about the
test case not lining up with the kernel code change.

Thanks,

-Jason

On 02/27/2015 03:56 PM, Hagen Paul Pfeifer wrote:
>
> Applause! Nice patch, sad that I submitted this patch ~3 years ago with
> exactly the same naming (EPOLLEXCLUSIVE) & nearly exact commit message and
> you rejected the patch ...
>
> Hagen
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-25 Thread Jason Baron
On 02/21/2015 07:24 PM, Eric Wong wrote:
> Jason Baron  wrote:
>> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
>>> * Ingo Molnar  wrote:
>>>
> [...] However, I think the userspace API change is less 
> clear since epoll_wait() doesn't currently have an 
> 'input' events argument as epoll_ctl() does.
 ... but the change would be a bit clearer and somewhat 
 more flexible: LIFO or FIFO queueing, right?

 But having the queueing model as part of the epoll 
 context is a legitimate approach as well.
>>> Btw., there's another optimization that the networking code 
>>> already does when processing incoming packets: waking up a 
>>> thread on the local CPU, where the wakeup is running.
>>>
>>> Doing the same on epoll would have real scalability 
>>> advantages where incoming events are IRQ driven and are 
>>> distributed amongst multiple CPUs.
>>>
>>> Where events are task driven the scheduler will already try 
>>> to pair up waker and wakee so it might not show up in 
>>> measurements that markedly.
>>>
>> Right, so this makes me think that we may want to potentially
>> support a variety of wakeup policies. Adding these to the
>> generic wake up code is just going to be too messy. So, perhaps
>> a better approach here would be to register a single
>> wait_queue_t with the event source queue that will always
>> be woken up, and then layer any epoll balancing/irq affinity
>> policies on top of that. So in essence we end up with sort of
>> two queues layers, but I think it provides much nicer isolation
>> between layers. Also, the bulk of the changes are going to be
>> isolated to the epoll code, and we avoid Andy's concern about
>> missing, or starving out wakeups.
>>
>> So here's a stab at how this API could look:
>>
>> 1. ep1 = epoll_create1(EPOLL_POLICY);
>>
>> So EPOLL_POLICY here could the round robin policy described
>> here, or the irq affinity or other ideas. The idea is to create
>> an fd that is local to the process, such that other processes
>> can not subsequently attach to it and affect our policy.
> I'm not against defining more policies if needed.
> Maybe FIFO vs LIFO is a good case for this.
>
> For affinity, it could probably be done transparently based on
> epoll_wait retrievals + EPOLL_CTL_MOD operations.
>
>> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
>>
>> This associates ep1 with the event source. ep1 can be
>> associated with or added to at most 1 wakeup source. This call
>> would largely just form the association, but not queue anything
>> to the fd_source wait queue.
> This would mean one extra FD for every fd_source, but that's
> only a handful of FDs (listen sockets), correct?

Yes, one extra epoll fd per shared wakeup source, so this should
result in very few additional fds.

>> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
>> epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
>> epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>>  .
>>  .
>>  .
>>
>> Finally, we add the epoll sets to the event source (indirectly via
>> ep1). So the first add would actually queue the callback to the
>> fd_source. While the subsequent calls would simply queue things
>> to the 'nested' wakeup queue associated with ep1.
> I'm not sure I follow, wouldn't this increase the number of wakeups?

I agree, my text there is confusing...I've posted this idea as
v3 of this series, so hopefully that clarifies this approach.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-21 Thread Eric Wong
Jason Baron  wrote:
> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> > * Ingo Molnar  wrote:
> >
> >>> [...] However, I think the userspace API change is less 
> >>> clear since epoll_wait() doesn't currently have an 
> >>> 'input' events argument as epoll_ctl() does.
> >> ... but the change would be a bit clearer and somewhat 
> >> more flexible: LIFO or FIFO queueing, right?
> >>
> >> But having the queueing model as part of the epoll 
> >> context is a legitimate approach as well.
> > Btw., there's another optimization that the networking code 
> > already does when processing incoming packets: waking up a 
> > thread on the local CPU, where the wakeup is running.
> >
> > Doing the same on epoll would have real scalability 
> > advantages where incoming events are IRQ driven and are 
> > distributed amongst multiple CPUs.
> >
> > Where events are task driven the scheduler will already try 
> > to pair up waker and wakee so it might not show up in 
> > measurements that markedly.
> >
> 
> Right, so this makes me think that we may want to potentially
> support a variety of wakeup policies. Adding these to the
> generic wake up code is just going to be too messy. So, perhaps
> a better approach here would be to register a single
> wait_queue_t with the event source queue that will always
> be woken up, and then layer any epoll balancing/irq affinity
> policies on top of that. So in essence we end up with sort of
> two queues layers, but I think it provides much nicer isolation
> between layers. Also, the bulk of the changes are going to be
> isolated to the epoll code, and we avoid Andy's concern about
> missing, or starving out wakeups.
> 
> So here's a stab at how this API could look:
> 
> 1. ep1 = epoll_create1(EPOLL_POLICY);
> 
> So EPOLL_POLICY here could the round robin policy described
> here, or the irq affinity or other ideas. The idea is to create
> an fd that is local to the process, such that other processes
> can not subsequently attach to it and affect our policy.

I'm not against defining more policies if needed.
Maybe FIFO vs LIFO is a good case for this.

For affinity, it could probably be done transparently based on
epoll_wait retrievals + EPOLL_CTL_MOD operations.

> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
> 
> This associates ep1 with the event source. ep1 can be
> associated with or added to at most 1 wakeup source. This call
> would largely just form the association, but not queue anything
> to the fd_source wait queue.

This would mean one extra FD for every fd_source, but that's
only a handful of FDs (listen sockets), correct?

> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
> epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
> epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>  .
>  .
>  .
> 
> Finally, we add the epoll sets to the event source (indirectly via
> ep1). So the first add would actually queue the callback to the
> fd_source. While the subsequent calls would simply queue things
> to the 'nested' wakeup queue associated with ep1.

I'm not sure I follow, wouldn't this increase the number of wakeups?

> So any existing epoll/poll/select calls could be queued as well
> to fd_source and will operate independenly from this mechanism,
> as the fd_source queue continues to be 'wake all'. Also, there
> should be no changes necessary to __wake_up_common(), other
> than potentially passing more back though the
> wait_queue_func_t, such as 'nr_exclusive'.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Jason Baron
On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> * Ingo Molnar  wrote:
>
>>> [...] However, I think the userspace API change is less 
>>> clear since epoll_wait() doesn't currently have an 
>>> 'input' events argument as epoll_ctl() does.
>> ... but the change would be a bit clearer and somewhat 
>> more flexible: LIFO or FIFO queueing, right?
>>
>> But having the queueing model as part of the epoll 
>> context is a legitimate approach as well.
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
>
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.
>
> Where events are task driven the scheduler will already try 
> to pair up waker and wakee so it might not show up in 
> measurements that markedly.
>

Right, so this makes me think that we may want to potentially
support a variety of wakeup policies. Adding these to the
generic wake up code is just going to be too messy. So, perhaps
a better approach here would be to register a single
wait_queue_t with the event source queue that will always
be woken up, and then layer any epoll balancing/irq affinity
policies on top of that. So in essence we end up with sort of
two queues layers, but I think it provides much nicer isolation
between layers. Also, the bulk of the changes are going to be
isolated to the epoll code, and we avoid Andy's concern about
missing, or starving out wakeups.

So here's a stab at how this API could look:

1. ep1 = epoll_create1(EPOLL_POLICY);

So EPOLL_POLICY here could the round robin policy described
here, or the irq affinity or other ideas. The idea is to create
an fd that is local to the process, such that other processes
can not subsequently attach to it and affect our policy.

2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);

This associates ep1 with the event source. ep1 can be
associated with or added to at most 1 wakeup source. This call
would largely just form the association, but not queue anything
to the fd_source wait queue.

3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
 .
 .
 .

Finally, we add the epoll sets to the event source (indirectly via
ep1). So the first add would actually queue the callback to the
fd_source. While the subsequent calls would simply queue things
to the 'nested' wakeup queue associated with ep1.

So any existing epoll/poll/select calls could be queued as well
to fd_source and will operate independenly from this mechanism,
as the fd_source queue continues to be 'wake all'. Also, there
should be no changes necessary to __wake_up_common(), other
than potentially passing more back though the
wait_queue_func_t, such as 'nr_exclusive'.

Thanks,

-Jason












--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Andy Lutomirski
On Feb 18, 2015 9:38 AM, "Jason Baron"  wrote:
>
> On 02/18/2015 11:33 AM, Ingo Molnar wrote:
> > * Jason Baron  wrote:
> >
> >>> This has two main advantages: firstly it solves the
> >>> O(N) (micro-)problem, but it also more evenly
> >>> distributes events both between task-lists and within
> >>> epoll groups as tasks as well.
> >> Its solving 2 issues - spurious wakeups, and more even
> >> loading of threads. The event distribution is more even
> >> between 'epoll groups' with this patch, however, if
> >> multiple threads are blocking on a single 'epoll group',
> >> this patch does not affect the the event distribution
> >> there. [...]
> > Regarding your last point, are you sure about that?
> >
> > If we have say 16 epoll threads registered, and if the list
> > is static (no register/unregister activity), then the
> > wakeup pattern is in strict order of the list: threads
> > closer to the list head will be woken more frequently, in a
> > wake-once fashion. So if threads do just quick work and go
> > back to sleep quickly, then typically only the first 2-3
> > threads will get any runtime in practice - the wakeup
> > iteration never gets 'deep' into the list.
> >
> > With the round-robin shuffling of the list, the threads get
> > shuffled to the tail on wakeup, which distributes events
> > evenly: all 16 epoll threads will accumulate an even
> > distribution of runtime, statistically.
> >
> > Have I misunderstood this somehow?
> >
> >
>
> So in the case of multiple threads per epoll set, we currently
> add to the head of wakeup queue exclusively in 'epoll_wait()',
> and then subsequently remove from the queue once
> 'epoll_wait()' returns. So I don't think this patch addresses
> balancing on a per epoll set basis.
>
> I think we could address the case you describe by simply doing
> __add_wait_queue_tail_exclusive() instead of
> __add_wait_queue_exclusive() in epoll_wait(). However, I think
> the userspace API change is less clear since epoll_wait() doesn't
> currently have an 'input' events argument as epoll_ctl() does.

FWIW there's currently discussion about adding a new epoll API for
batch epoll_ctl.  It could be with coordinating with that effort if
some variant could address both use cases.

I'm still nervous about changing the per-fd wakeup stuff to do
anything other than waking everything.  After all, epoll and poll can
be used concurrently.

What about a slightly different approach: could an epoll fd support
multiple contexts?  For example, an fd could be set (with epoll_ctl or
the new batch stuff) to wake an any epoll waiter, one specific epoll
waiter, an epoll waiter preferably on the waking cpu, etc.

This would have the benefit of keeping the wakeup changes localized to
the epoll code.

--Andy

>
> Thanks,
>
> -Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Eric Wong
Ingo Molnar  wrote:
> 
> * Ingo Molnar  wrote:
> 
> > > [...] However, I think the userspace API change is less 
> > > clear since epoll_wait() doesn't currently have an 
> > > 'input' events argument as epoll_ctl() does.
> > 
> > ... but the change would be a bit clearer and somewhat 
> > more flexible: LIFO or FIFO queueing, right?
> > 
> > But having the queueing model as part of the epoll 
> > context is a legitimate approach as well.
> 
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
> 
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.

Right.  One thing in the back of my mind has been to have CPU
affinity for epoll.  Either having everything in an epoll set
favor a certain CPU or even having affinity down to the epitem
level (so concurrent epoll_wait callers end up favoring the
same epitems).

I'm not convinced this series is worth doing without a
comparison against my previous suggestion to use a dedicated
thread which only makes blocking accept4 + EPOLL_CTL_ADD calls.

The majority of epoll events in a typical server should not be
for listen sockets, so I'd rather not bloat existing code paths
for them.  For web servers nowadays, the benefits of maintaining
long-lived connections to avoid handshakes is even more
beneficial with increasing HTTPS and HTTP2 adoption; so
listen socket events should become less common.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > [...] However, I think the userspace API change is less 
> > clear since epoll_wait() doesn't currently have an 
> > 'input' events argument as epoll_ctl() does.
> 
> ... but the change would be a bit clearer and somewhat 
> more flexible: LIFO or FIFO queueing, right?
> 
> But having the queueing model as part of the epoll 
> context is a legitimate approach as well.

Btw., there's another optimization that the networking code 
already does when processing incoming packets: waking up a 
thread on the local CPU, where the wakeup is running.

Doing the same on epoll would have real scalability 
advantages where incoming events are IRQ driven and are 
distributed amongst multiple CPUs.

Where events are task driven the scheduler will already try 
to pair up waker and wakee so it might not show up in 
measurements that markedly.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Ingo Molnar

* Jason Baron  wrote:

> So in the case of multiple threads per epoll set, we 
> currently add to the head of wakeup queue exclusively in 
> 'epoll_wait()', and then subsequently remove from the 
> queue once 'epoll_wait()' returns. So I don't think this 
> patch addresses balancing on a per epoll set basis.

Okay, so I was confused about how the code works.

> I think we could address the case you describe by simply 
> doing __add_wait_queue_tail_exclusive() instead of 
> __add_wait_queue_exclusive() in epoll_wait(). [...]

Yes.

> [...] However, I think the userspace API change is less 
> clear since epoll_wait() doesn't currently have an 
> 'input' events argument as epoll_ctl() does.

... but the change would be a bit clearer and somewhat more 
flexible: LIFO or FIFO queueing, right?

But having the queueing model as part of the epoll context 
is a legitimate approach as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Jason Baron
On 02/18/2015 11:33 AM, Ingo Molnar wrote:
> * Jason Baron  wrote:
>
>>> This has two main advantages: firstly it solves the 
>>> O(N) (micro-)problem, but it also more evenly 
>>> distributes events both between task-lists and within 
>>> epoll groups as tasks as well.
>> Its solving 2 issues - spurious wakeups, and more even 
>> loading of threads. The event distribution is more even 
>> between 'epoll groups' with this patch, however, if 
>> multiple threads are blocking on a single 'epoll group', 
>> this patch does not affect the the event distribution 
>> there. [...]
> Regarding your last point, are you sure about that?
>
> If we have say 16 epoll threads registered, and if the list 
> is static (no register/unregister activity), then the 
> wakeup pattern is in strict order of the list: threads 
> closer to the list head will be woken more frequently, in a 
> wake-once fashion. So if threads do just quick work and go 
> back to sleep quickly, then typically only the first 2-3 
> threads will get any runtime in practice - the wakeup 
> iteration never gets 'deep' into the list.
>
> With the round-robin shuffling of the list, the threads get 
> shuffled to the tail on wakeup, which distributes events 
> evenly: all 16 epoll threads will accumulate an even 
> distribution of runtime, statistically.
>
> Have I misunderstood this somehow?
>
>

So in the case of multiple threads per epoll set, we currently
add to the head of wakeup queue exclusively in 'epoll_wait()',
and then subsequently remove from the queue once
'epoll_wait()' returns. So I don't think this patch addresses
balancing on a per epoll set basis.

I think we could address the case you describe by simply doing
__add_wait_queue_tail_exclusive() instead of
__add_wait_queue_exclusive() in epoll_wait(). However, I think
the userspace API change is less clear since epoll_wait() doesn't
currently have an 'input' events argument as epoll_ctl() does.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Ingo Molnar

* Jason Baron  wrote:

> > This has two main advantages: firstly it solves the 
> > O(N) (micro-)problem, but it also more evenly 
> > distributes events both between task-lists and within 
> > epoll groups as tasks as well.
> 
> Its solving 2 issues - spurious wakeups, and more even 
> loading of threads. The event distribution is more even 
> between 'epoll groups' with this patch, however, if 
> multiple threads are blocking on a single 'epoll group', 
> this patch does not affect the the event distribution 
> there. [...]

Regarding your last point, are you sure about that?

If we have say 16 epoll threads registered, and if the list 
is static (no register/unregister activity), then the 
wakeup pattern is in strict order of the list: threads 
closer to the list head will be woken more frequently, in a 
wake-once fashion. So if threads do just quick work and go 
back to sleep quickly, then typically only the first 2-3 
threads will get any runtime in practice - the wakeup 
iteration never gets 'deep' into the list.

With the round-robin shuffling of the list, the threads get 
shuffled to the tail on wakeup, which distributes events 
evenly: all 16 epoll threads will accumulate an even 
distribution of runtime, statistically.

Have I misunderstood this somehow?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Jason Baron
On 02/18/2015 03:07 AM, Ingo Molnar wrote:
> * Jason Baron  wrote:
>
>> Epoll file descriptors that are added to a shared wakeup 
>> source are always added in a non-exclusive manner. That 
>> means that when we have multiple epoll fds attached to a 
>> shared wakeup source they are all woken up. This can lead 
>> to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are 
>> intended to be used with EPOLL_CTL_ADD operations. 
>> EPOLLEXCLUSIVE, adds the epoll fd to the event source in 
>> an exclusive manner such that the minimum number of 
>> threads are woken. EPOLLROUNDROBIN, which depends on 
>> EPOLLEXCLUSIVE also being set, can also be added to the 
>> 'events' flag, such that we round robin through the set 
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup 
>> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, 
>> we return 1, for a successful wakeup, only when there are 
>> current waiters. The idea is to use this additional 
>> heuristic in order minimize wakeup latencies.
>>
>> Signed-off-by: Jason Baron 
>> ---
>>  fs/eventpoll.c | 25 -
>>  include/uapi/linux/eventpoll.h |  6 ++
>>  2 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index d77f944..382c832 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -92,7 +92,8 @@
>>   */
>>  
>>  /* Epoll private bits inside the event mask */
>> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
>> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
>> + EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>>  
>>  /* Maximum number of nesting allowed inside epoll sets */
>>  #define EP_MAX_NESTS 4
>> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, 
>> unsigned mode, int sync, void *k
>>  unsigned long flags;
>>  struct epitem *epi = ep_item_from_wait(wait);
>>  struct eventpoll *ep = epi->ep;
>> +int ewake = 0;
>>  
>>  if ((unsigned long)key & POLLFREE) {
>>  ep_pwq_from_wait(wait)->whead = NULL;
>> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, 
>> unsigned mode, int sync, void *k
>>   * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>>   * wait list.
>>   */
>> -if (waitqueue_active(&ep->wq))
>> +if (waitqueue_active(&ep->wq)) {
>> +ewake = 1;
>>  wake_up_locked(&ep->wq);
>> +}
>>  if (waitqueue_active(&ep->poll_wait))
>>  pwake++;
>>  
>> @@ -1078,6 +1082,8 @@ out_unlock:
>>  if (pwake)
>>  ep_poll_safewake(&ep->poll_wait);
>>  
>> +if (epi->event.events & EPOLLROUNDROBIN)
>> +return ewake;
>>  return 1;
>>  }
>>  
>> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, 
>> wait_queue_head_t *whead,
>>  init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>>  pwq->whead = whead;
>>  pwq->base = epi;
>> -add_wait_queue(whead, &pwq->wait);
>> +if (epi->event.events & EPOLLROUNDROBIN)
>> +add_wait_queue_rr(whead, &pwq->wait);
>> +else if (epi->event.events & EPOLLEXCLUSIVE)
>> +add_wait_queue_exclusive(whead, &pwq->wait);
>> +else
>> +add_wait_queue(whead, &pwq->wait);
>>  list_add_tail(&pwq->llink, &epi->pwqlist);
>>  epi->nwait++;
>>  } else {
>> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>>  struct epoll_event __user *, event)
>>  {
>> -int error;
>> -int full_check = 0;
>> +int error, full_check = 0, wait_flags = 0;
>>  struct fd f, tf;
>>  struct eventpoll *ep;
>>  struct epitem *epi;
>> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, 
>> fd,
>>  if (f.file == tf.file || !is_file_epoll(f.file))
>>  goto error_tgt_fput;
>>  
>> +wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
>> +if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
>> +((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))
>> +goto error_tgt_fput;
>> +
>>  /*
>>   * At this point it is safe to assume that the "private_data" contains
>>   * our own data structure.
>> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
>> index bc81fb2..10260a1 100644
>> --- a/include/uapi/linux/eventpoll.h
>> +++ b/include/uapi/linux/eventpoll.h
>> @@ -26,6 +26,12 @@
>>  #define EPOLL_CTL_DEL 2
>>  #define EPOLL_CTL_MOD 3
>>  
>> +/* Balance wakeups for a shared event source */
>> +#define EPOLLROUNDROBIN (1 << 27)
>> +
>> +/* Add exclusively */
>> +#define EPOLLEXCLUSIVE (1 << 28)
>> +
>>  /*
>>   * Request the handling of sy

Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Ingo Molnar

* Jason Baron  wrote:

> Epoll file descriptors that are added to a shared wakeup 
> source are always added in a non-exclusive manner. That 
> means that when we have multiple epoll fds attached to a 
> shared wakeup source they are all woken up. This can lead 
> to excessive cpu usage and uneven load distribution.
> 
> This patch introduces two new 'events' flags that are 
> intended to be used with EPOLL_CTL_ADD operations. 
> EPOLLEXCLUSIVE, adds the epoll fd to the event source in 
> an exclusive manner such that the minimum number of 
> threads are woken. EPOLLROUNDROBIN, which depends on 
> EPOLLEXCLUSIVE also being set, can also be added to the 
> 'events' flag, such that we round robin through the set 
> of waiting threads.
> 
> An implementation note is that in the epoll wakeup 
> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, 
> we return 1, for a successful wakeup, only when there are 
> current waiters. The idea is to use this additional 
> heuristic in order minimize wakeup latencies.
> 
> Signed-off-by: Jason Baron 
> ---
>  fs/eventpoll.c | 25 -
>  include/uapi/linux/eventpoll.h |  6 ++
>  2 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d77f944..382c832 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,8 @@
>   */
>  
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
> +  EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>  
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, 
> unsigned mode, int sync, void *k
>   unsigned long flags;
>   struct epitem *epi = ep_item_from_wait(wait);
>   struct eventpoll *ep = epi->ep;
> + int ewake = 0;
>  
>   if ((unsigned long)key & POLLFREE) {
>   ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, 
> unsigned mode, int sync, void *k
>* Wake up ( if active ) both the eventpoll wait list and the ->poll()
>* wait list.
>*/
> - if (waitqueue_active(&ep->wq))
> + if (waitqueue_active(&ep->wq)) {
> + ewake = 1;
>   wake_up_locked(&ep->wq);
> + }
>   if (waitqueue_active(&ep->poll_wait))
>   pwake++;
>  
> @@ -1078,6 +1082,8 @@ out_unlock:
>   if (pwake)
>   ep_poll_safewake(&ep->poll_wait);
>  
> + if (epi->event.events & EPOLLROUNDROBIN)
> + return ewake;
>   return 1;
>  }
>  
> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, 
> wait_queue_head_t *whead,
>   init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>   pwq->whead = whead;
>   pwq->base = epi;
> - add_wait_queue(whead, &pwq->wait);
> + if (epi->event.events & EPOLLROUNDROBIN)
> + add_wait_queue_rr(whead, &pwq->wait);
> + else if (epi->event.events & EPOLLEXCLUSIVE)
> + add_wait_queue_exclusive(whead, &pwq->wait);
> + else
> + add_wait_queue(whead, &pwq->wait);
>   list_add_tail(&pwq->llink, &epi->pwqlist);
>   epi->nwait++;
>   } else {
> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>   struct epoll_event __user *, event)
>  {
> - int error;
> - int full_check = 0;
> + int error, full_check = 0, wait_flags = 0;
>   struct fd f, tf;
>   struct eventpoll *ep;
>   struct epitem *epi;
> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>   if (f.file == tf.file || !is_file_epoll(f.file))
>   goto error_tgt_fput;
>  
> + wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
> + if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
> + ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))
> + goto error_tgt_fput;
> +
>   /*
>* At this point it is safe to assume that the "private_data" contains
>* our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index bc81fb2..10260a1 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,12 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
>  
> +/* Balance wakeups for a shared event source */
> +#define EPOLLROUNDROBIN (1 << 27)
> +
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent system 
> suspends
>   * from happening while those events are being processed.

So