On Sun, May 3, 2026 at 3:24 PM Nam Cao <[email protected]> wrote:
>
> Mateusz Guzik <[email protected]> writes:
> > Strictly speaking more error prone than the seq approach, but should be
> > faster on weaker-ordered archs thanks to avoided fences.
> >
> > I'm definitely not going to protest the seqc route.
>
> Linus probably wouldn't be thrilled if I break epoll again, so let's
> stay with the simpler seqcount route.
>
> Nam
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index a3090b446af1..22c3f0186476 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -38,6 +38,7 @@
>  #include <linux/compat.h>
>  #include <linux/rculist.h>
>  #include <linux/capability.h>
> +#include <linux/seqlock.h>
>  #include <net/busy_poll.h>
>
>  /*
> @@ -190,6 +191,9 @@ struct eventpoll {
>         /* Lock which protects rdllist and ovflist */
>         spinlock_t lock;
>
> +       /* Protect switching between rdllist and ovflist */
> +       seqcount_spinlock_t seq;
> +
>         /* RB tree root used to store monitored fd structs */
>         struct rb_root_cached rbr;
>
> @@ -382,8 +386,17 @@ static inline struct epitem 
> *ep_item_from_wait(wait_queue_entry_t *p)
>   */
>  static inline int ep_events_available(struct eventpoll *ep)
>  {
> -       return !list_empty_careful(&ep->rdllist) ||
> -               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
> +       bool events_available;
> +       unsigned int seq;
> +
> +       do {
> +               seq = read_seqcount_begin(&ep->seq);
> +
> +               events_available = !list_empty_careful(&ep->rdllist) ||
> +                                  READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
> +       } while (read_seqcount_retry(&ep->seq, seq));
> +
> +       return events_available;
>  }
>
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> @@ -735,8 +748,12 @@ static void ep_start_scan(struct eventpoll *ep, struct 
> list_head *txlist)
>          */
>         lockdep_assert_irqs_enabled();
>         spin_lock_irq(&ep->lock);
> +       write_seqcount_begin(&ep->seq);
> +
>         list_splice_init(&ep->rdllist, txlist);
>         WRITE_ONCE(ep->ovflist, NULL);
> +
> +       write_seqcount_end(&ep->seq);
>         spin_unlock_irq(&ep->lock);
>  }
>
> @@ -768,6 +785,9 @@ static void ep_done_scan(struct eventpoll *ep,
>                         ep_pm_stay_awake(epi);
>                 }
>         }
> +
> +       write_seqcount_begin(&ep->seq);
> +
>         /*
>          * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
>          * releasing the lock, events will be queued in the normal way inside
> @@ -779,6 +799,9 @@ static void ep_done_scan(struct eventpoll *ep,
>          * Quickly re-inject items left on "txlist".
>          */
>         list_splice(txlist, &ep->rdllist);
> +
> +       write_seqcount_end(&ep->seq);
> +
>         __pm_relax(ep->ws);
>
>         if (!list_empty(&ep->rdllist)) {
> @@ -1155,6 +1178,7 @@ static int ep_alloc(struct eventpoll **pep)
>
>         mutex_init(&ep->mtx);
>         spin_lock_init(&ep->lock);
> +       seqcount_spinlock_init(&ep->seq, &ep->lock);
>         init_waitqueue_head(&ep->wq);
>         init_waitqueue_head(&ep->poll_wait);
>         INIT_LIST_HEAD(&ep->rdllist);

Apologies for late reply.

The diff reads ok to me, but I would consider not looping in case of
seq mismatch.

So does it solve the problem?

I think a somewhat of a blocker here would be to bench the thing -- I
would expect some slowdown compared to stock kernel, but should be
still be faster than the previously proposed patch.

Reply via email to