On 5/5/20 4:40 AM, Roman Penyaev wrote:
> The original problem was described here:
>    https://lkml.org/lkml/2020/4/27/1121
> 
> There is a possible race when ep_scan_ready_list() leaves ->rdllist
> and ->obflist empty for a short period of time although some events
> are pending. It is quite likely that ep_events_available() observes
> empty lists and goes to sleep. Since 339ddb53d373 ("fs/epoll: remove
> unnecessary wakeups of nested epoll") we are conservative in wakeups
> (there is only one place for wakeup and this is ep_poll_callback()),
> thus ep_events_available() must always observe correct state of
> two lists. The easiest and correct way is to do the final check
> under the lock. This does not impact the performance, since lock
> is taken anyway for adding a wait entry to the wait queue.
> 
> In this patch barrierless __set_current_state() is used. This is
> safe since waitqueue_active() is called under the same lock on wakeup
> side.
> 
> Short-circuit for fatal signals (i.e. fatal_signal_pending() check)
> is moved to the line just before actual events harvesting routine.
> This is fully compliant to what is said in the comment of the patch
> where the actual fatal_signal_pending() check was added:
> c257a340ede0 ("fs, epoll: short circuit fetching events if thread
> has been killed").
> 
> Signed-off-by: Roman Penyaev <rpeny...@suse.de>
> Reported-by: Jason Baron <jba...@akamai.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Khazhismel Kumykov <kha...@google.com>
> Cc: Alexander Viro <v...@zeniv.linux.org.uk>
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: sta...@vger.kernel.org
> ---
>  fs/eventpoll.c | 48 ++++++++++++++++++++++++++++--------------------
>  1 file changed, 28 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index aba03ee749f8..8453e5403283 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1879,34 +1879,33 @@ static int ep_poll(struct eventpoll *ep, struct 
> epoll_event __user *events,
>                * event delivery.
>                */
>               init_wait(&wait);
> -             write_lock_irq(&ep->lock);
> -             __add_wait_queue_exclusive(&ep->wq, &wait);
> -             write_unlock_irq(&ep->lock);
>  
> +             write_lock_irq(&ep->lock);
>               /*
> -              * We don't want to sleep if the ep_poll_callback() sends us
> -              * a wakeup in between. That's why we set the task state
> -              * to TASK_INTERRUPTIBLE before doing the checks.
> +              * Barrierless variant, waitqueue_active() is called under
> +              * the same lock on wakeup ep_poll_callback() side, so it
> +              * is safe to avoid an explicit barrier.
>                */
> -             set_current_state(TASK_INTERRUPTIBLE);
> +             __set_current_state(TASK_INTERRUPTIBLE);
> +
>               /*
> -              * Always short-circuit for fatal signals to allow
> -              * threads to make a timely exit without the chance of
> -              * finding more events available and fetching
> -              * repeatedly.
> +              * Do the final check under the lock. ep_scan_ready_list()
> +              * plays with two lists (->rdllist and ->ovflist) and there
> +              * is always a race when both lists are empty for short
> +              * period of time although events are pending, so lock is
> +              * important.
>                */
> -             if (fatal_signal_pending(current)) {
> -                     res = -EINTR;
> -                     break;
> +             eavail = ep_events_available(ep);
> +             if (!eavail) {
> +                     if (signal_pending(current))
> +                             res = -EINTR;
> +                     else
> +                             __add_wait_queue_exclusive(&ep->wq, &wait);
>               }
> +             write_unlock_irq(&ep->lock);
>  
> -             eavail = ep_events_available(ep);
> -             if (eavail)
> -                     break;
> -             if (signal_pending(current)) {
> -                     res = -EINTR;
> +             if (eavail || res)
>                       break;
> -             }
>  
>               if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
>                       timed_out = 1;
> @@ -1927,6 +1926,15 @@ static int ep_poll(struct eventpoll *ep, struct 
> epoll_event __user *events,
>       }
>  
>  send_events:
> +     if (fatal_signal_pending(current))
> +             /*
> +              * Always short-circuit for fatal signals to allow
> +              * threads to make a timely exit without the chance of
> +              * finding more events available and fetching
> +              * repeatedly.
> +              */
> +             res = -EINTR;
> +
>       /*
>        * Try to transfer events to user space. In case we get 0 events and
>        * there's still timeout left over, we go trying again in search of
> 



Hi Roman,

Looks good feel free to add:
Reviewed-by: Jason Baron <jba...@akamai.com>

I think we should also add the fixes tag to assist stable backports:
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")

Thanks,

-Jason

Reply via email to