Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

2020-04-29 Thread Roman Penyaev

On 2020-04-29 06:12, Jason Baron wrote:

On 4/28/20 2:10 PM, Roman Penyaev wrote:

On 2020-04-27 22:38, Jason Baron wrote:

On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
On Sat, Apr 25, 2020 at 9:17 AM Jason Baron  
wrote:




On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
In the event that we add to ovflist, before 339ddb53d373 we would 
be
woken up by ep_scan_ready_list, and did no wakeup in 
ep_poll_callback.
With that wakeup removed, if we add to ovflist here, we may never 
wake
up. Rather than adding back the ep_scan_ready_list wakeup - which 
was
resulting in unnecessary wakeups, trigger a wake-up in 
ep_poll_callback.


I'm just curious which 'wakeup' we are talking about here? There 
is:

wake_up(>wq) for the 'ep' and then there is the nested one via:
ep_poll_safewake(ep, epi). It seems to me that its only about the 
later

one being missing not both? Is your workload using nested epoll?

If so, it might make sense to just do the later, since the point of
the original patch was to minimize unnecessary wakeups.


The missing wake-ups were when we added to ovflist instead of 
rdllist.

Both are "the ready list" together - so I'd think we'd want the same
wakeups regardless of which specific list we added to.
ep_poll_callback isn't nested specific?



So I was thinking that ep_poll() would see these events on the
ovflist without an explicit wakeup, b/c the overflow list being 
active

implies that the ep_poll() path would add them to the rdllist in
ep_scan_read_list(). Thus, it will see the events either in the
current ep_poll() context or via a subsequent syscall to 
epoll_wait().


However, there are other paths that can call ep_scan_ready_list() 
thus

I agree with you that both wakeups here are necessary.

I do think are are still (smaller) potential races in 
ep_scan_ready_list()

where we have:

    write_lock_irq(>lock);
    list_splice_init(>rdllist, );
    WRITE_ONCE(ep->ovflist, NULL);
    write_unlock_irq(>lock);

And in the ep_poll path we have:

static inline int ep_events_available(struct eventpoll *ep)
{
    return !list_empty_careful(>rdllist) ||
    READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
}


Seems to me that first bit needs to be the following, since
ep_events_available() is now checked in a lockless way:


    write_lock_irq(>lock);
WRITE_ONCE(ep->ovflist, NULL);
smp_wmb();
    list_splice_init(>rdllist, );
    write_unlock_irq(>lock);



Hi Jason,

For the first chunk you refer the order seems irrelevant.
Either you see something not empty, you go take the lock
and then check lists under the lock, either you see all
lists are empty.



Hi Roman,

It does matter. Let's say we have:

epfd1->epfd2->socket. And thread a is doing an
epoll_wait() on epfd1, and thread b is doing
epoll_wait on epfd2. then:

1) data comes in on socket

ep_poll_callback() wakes up threads a and b

2) thread a runs

ep_poll()
 ep_scan_ready_list()
  ep_send_events_proc()
   ep_item_poll()
 ep_scan_ready_list()
   list_splice_init(>rdllist, );

3) now thread b is running

ep_poll()
 ep_events_available()
   returns false
 schedule_hrtimeout_range()

Thus, thread c has missed a wakeup and will never
get it.


Similarly, for the second chunk I referenced.


Hi Jason,

Yes, that makes sense.



And also this bit:

    WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
    /*
 * Quickly re-inject items left on "txlist".
 */
    list_splice(, >rdllist);

Should I think better be reversed as well to:

list_splice(, >rdllist);
smp_wmb();
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);


But this one is much more interesting.  I understand what you
are trying to achieve: we can't leave both lists empty for the
short period of time, if there is something left the caller
of ep_events_available() should be able to see one of the lists
is not empty, otherwise it can be too late.

But the problem you've spotted is much worse. Some remains
can be in the txlist (this can happen if caller of epoll_wait
wants to harvest only 1 event, but there are more in the ->rdlist).
And we again get the lost wakeup.

Problem is reproduced by the test below.  The idea is simple:
we have 10 threads and 10 event fds. Each thread can harvest
only 1 event. 1 producer fires all 10 events at once and waits
that all 10 events will be observed by 10 threads.

The fix is basically a revert of 339ddb53d373 with 1 major
exception: we do wakeups from ep_scan_ready_list() but
if txlist is not empty && if ep_scan_ready_list() is called
from the routine, which sends events, not reads it
(thus we protect ourselves from repeated wake ups)

I will send the code a bit later.


This was discussed as part of the original discussion around
339ddb53d373: https://lkml.org/lkml/2019/10/7/905


True! This is the exact scenario which is covered by the
test from my previous email. And the test fails. I forgot
about this discussion.


The context was more a performance difference rather than
a 

Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

2020-04-28 Thread Jason Baron



On 4/28/20 2:10 PM, Roman Penyaev wrote:
> On 2020-04-27 22:38, Jason Baron wrote:
>> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron  wrote:



 On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
> In the event that we add to ovflist, before 339ddb53d373 we would be
> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
> With that wakeup removed, if we add to ovflist here, we may never wake
> up. Rather than adding back the ep_scan_ready_list wakeup - which was
> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.

 I'm just curious which 'wakeup' we are talking about here? There is:
 wake_up(>wq) for the 'ep' and then there is the nested one via:
 ep_poll_safewake(ep, epi). It seems to me that its only about the later
 one being missing not both? Is your workload using nested epoll?

 If so, it might make sense to just do the later, since the point of
 the original patch was to minimize unnecessary wakeups.
>>>
>>> The missing wake-ups were when we added to ovflist instead of rdllist.
>>> Both are "the ready list" together - so I'd think we'd want the same
>>> wakeups regardless of which specific list we added to.
>>> ep_poll_callback isn't nested specific?
>>>
>>
>> So I was thinking that ep_poll() would see these events on the
>> ovflist without an explicit wakeup, b/c the overflow list being active
>> implies that the ep_poll() path would add them to the rdllist in
>> ep_scan_read_list(). Thus, it will see the events either in the
>> current ep_poll() context or via a subsequent syscall to epoll_wait().
>>
>> However, there are other paths that can call ep_scan_ready_list() thus
>> I agree with you that both wakeups here are necessary.
>>
>> I do think are are still (smaller) potential races in ep_scan_ready_list()
>> where we have:
>>
>>     write_lock_irq(>lock);
>>     list_splice_init(>rdllist, );
>>     WRITE_ONCE(ep->ovflist, NULL);
>>     write_unlock_irq(>lock);
>>
>> And in the ep_poll path we have:
>>
>> static inline int ep_events_available(struct eventpoll *ep)
>> {
>>     return !list_empty_careful(>rdllist) ||
>>     READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> }
>>
>>
>> Seems to me that first bit needs to be the following, since
>> ep_events_available() is now checked in a lockless way:
>>
>>
>>     write_lock_irq(>lock);
>> WRITE_ONCE(ep->ovflist, NULL);
>> smp_wmb();
>>     list_splice_init(>rdllist, );
>>     write_unlock_irq(>lock);
> 
> 
> Hi Jason,
> 
> For the first chunk you refer the order seems irrelevant.
> Either you see something not empty, you go take the lock
> and then check lists under the lock, either you see all
> lists are empty.
> 

Hi Roman,

It does matter. Let's say we have:

epfd1->epfd2->socket. And thread a is doing an
epoll_wait() on epfd1, and thread b is doing
epoll_wait on epfd2. then:

1) data comes in on socket

ep_poll_callback() wakes up threads a and b

2) thread a runs

ep_poll()
 ep_scan_ready_list()
  ep_send_events_proc()
   ep_item_poll()
 ep_scan_ready_list()
   list_splice_init(>rdllist, );

3) now thread b is running

ep_poll()
 ep_events_available()
   returns false
 schedule_hrtimeout_range()

Thus, thread c has missed a wakeup and will never
get it.


Similarly, for the second chunk I referenced.

>>
>> And also this bit:
>>
>>     WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
>>     /*
>>  * Quickly re-inject items left on "txlist".
>>  */
>>     list_splice(, >rdllist);
>>
>> Should I think better be reversed as well to:
>>
>> list_splice(, >rdllist);
>> smp_wmb();
>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
> 
> But this one is much more interesting.  I understand what you
> are trying to achieve: we can't leave both lists empty for the
> short period of time, if there is something left the caller
> of ep_events_available() should be able to see one of the lists
> is not empty, otherwise it can be too late.
> 
> But the problem you've spotted is much worse. Some remains
> can be in the txlist (this can happen if caller of epoll_wait
> wants to harvest only 1 event, but there are more in the ->rdlist).
> And we again get the lost wakeup.
> 
> Problem is reproduced by the test below.  The idea is simple:
> we have 10 threads and 10 event fds. Each thread can harvest
> only 1 event. 1 producer fires all 10 events at once and waits
> that all 10 events will be observed by 10 threads.
> 
> The fix is basically a revert of 339ddb53d373 with 1 major
> exception: we do wakeups from ep_scan_ready_list() but
> if txlist is not empty && if ep_scan_ready_list() is called
> from the routine, which sends events, not reads it
> (thus we protect ourselves from repeated wake ups)
> 
> I will send the code a bit later.

This was discussed as part of the original discussion around
339ddb53d373: 

Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

2020-04-28 Thread Roman Penyaev

On 2020-04-27 22:38, Jason Baron wrote:

On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:

On Sat, Apr 25, 2020 at 9:17 AM Jason Baron  wrote:




On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:

In the event that we add to ovflist, before 339ddb53d373 we would be
woken up by ep_scan_ready_list, and did no wakeup in 
ep_poll_callback.
With that wakeup removed, if we add to ovflist here, we may never 
wake
up. Rather than adding back the ep_scan_ready_list wakeup - which 
was
resulting in unnecessary wakeups, trigger a wake-up in 
ep_poll_callback.


I'm just curious which 'wakeup' we are talking about here? There is:
wake_up(>wq) for the 'ep' and then there is the nested one via:
ep_poll_safewake(ep, epi). It seems to me that its only about the 
later

one being missing not both? Is your workload using nested epoll?

If so, it might make sense to just do the later, since the point of
the original patch was to minimize unnecessary wakeups.


The missing wake-ups were when we added to ovflist instead of rdllist.
Both are "the ready list" together - so I'd think we'd want the same
wakeups regardless of which specific list we added to.
ep_poll_callback isn't nested specific?



So I was thinking that ep_poll() would see these events on the
ovflist without an explicit wakeup, b/c the overflow list being active
implies that the ep_poll() path would add them to the rdllist in
ep_scan_read_list(). Thus, it will see the events either in the
current ep_poll() context or via a subsequent syscall to epoll_wait().

However, there are other paths that can call ep_scan_ready_list() thus
I agree with you that both wakeups here are necessary.

I do think are are still (smaller) potential races in 
ep_scan_ready_list()

where we have:

write_lock_irq(>lock);
list_splice_init(>rdllist, );
WRITE_ONCE(ep->ovflist, NULL);
write_unlock_irq(>lock);

And in the ep_poll path we have:

static inline int ep_events_available(struct eventpoll *ep)
{
return !list_empty_careful(>rdllist) ||
READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
}


Seems to me that first bit needs to be the following, since
ep_events_available() is now checked in a lockless way:


write_lock_irq(>lock);
WRITE_ONCE(ep->ovflist, NULL);
smp_wmb();
list_splice_init(>rdllist, );
write_unlock_irq(>lock);



Hi Jason,

For the first chunk you refer the order seems irrelevant.
Either you see something not empty, you go take the lock
and then check lists under the lock, either you see all
lists are empty.



And also this bit:

WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);

/*
 * Quickly re-inject items left on "txlist".
 */
list_splice(, >rdllist);

Should I think better be reversed as well to:

list_splice(, >rdllist);
smp_wmb();
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);


But this one is much more interesting.  I understand what you
are trying to achieve: we can't leave both lists empty for the
short period of time, if there is something left the caller
of ep_events_available() should be able to see one of the lists
is not empty, otherwise it can be too late.

But the problem you've spotted is much worse. Some remains
can be in the txlist (this can happen if caller of epoll_wait
wants to harvest only 1 event, but there are more in the ->rdlist).
And we again get the lost wakeup.

Problem is reproduced by the test below.  The idea is simple:
we have 10 threads and 10 event fds. Each thread can harvest
only 1 event. 1 producer fires all 10 events at once and waits
that all 10 events will be observed by 10 threads.

The fix is basically a revert of 339ddb53d373 with 1 major
exception: we do wakeups from ep_scan_ready_list() but
if txlist is not empty && if ep_scan_ready_list() is called
from the routine, which sends events, not reads it
(thus we protect ourselves from repeated wake ups)

I will send the code a bit later.

--
Roman

 test ---

enum {
EPOLL60_EVENTS_NR = 10,
};

struct epoll60_ctx {
volatile int stopped;
int ready;
int waiters;
int epfd;
int evfd[EPOLL60_EVENTS_NR];
};

static inline int count_waiters(struct epoll60_ctx *ctx)
{
return __atomic_load_n(>waiters, __ATOMIC_ACQUIRE);
}

static void *epoll60_wait_thread(void *ctx_)
{
struct epoll60_ctx *ctx = ctx_;
struct epoll_event e;
uint64_t v;
int ret;

while (!ctx->stopped) {
/* Mark we are ready */
__atomic_fetch_add(>ready, 1, __ATOMIC_ACQUIRE);

/* Start when all are ready */
while (__atomic_load_n(>ready, __ATOMIC_ACQUIRE) &&
   !ctx->stopped);

/* Account this waiter */
__atomic_fetch_add(>waiters, 1, __ATOMIC_ACQUIRE);
again_wait:
ret = epoll_wait(ctx->epfd, , 1, 1000);
if (ret != 1) {
/* Should be stopped,