mutex in anon_pipe_write

Mateusz Guzik Thu, 21 May 2026 10:29:18 -0700

On Thu, May 21, 2026 at 5:38 PM Breno Leitao <[email protected]> wrote:
>
> Also, I am not reusing the leftover pages back to pipe->tmp_page, given
> I really don't expect to have much leftovers, but, I am happy to do so,
> in case you think this would be useful.
>


Since on the allocation side you do take from tmp_page, a big write
ends up depleting it, afterwards you free up whatever preallocated
pages.

If this is followed by a small write, that small write now has to
suffer allocation *with* the lock held, bringing the original problem.
Easiest way out is to take from preallocated pages before looking at
tmp_page.

I do suspect the best way forward in the long run is repopulate the
tmp_page array if you have pages to do so.

> commit 2394068de8ee0ab9e582ce9776990ed695397553
> Author: Breno Leitao <[email protected]>
> Date:   Fri May 15 01:35:13 2026 -0700
>
>     fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
>
>     anon_pipe_write() takes pipe->mutex and then, from the per-iteration
>     anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
>     __GFP_ACCOUNT) once per page while still holding it. That allocation
>     can sleep doing direct reclaim and/or runs memcg charging, which extends
>     the critical section and stalls a concurrent reader on the very same
>     mutex.
>
>     For writes that span more than one full page, bulk pre-allocate up to
>     PIPE_PREALLOC_MAX (8) pages with alloc_pages_bulk() before taking
>     pipe->mutex. anon_pipe_get_page() consumes from pipe->tmp_page[] first,
>     then from the prealloc array, and only falls back to alloc_page() under
>     the mutex for writes larger than PIPE_PREALLOC_MAX pages (rare) or for
>     single-page writes. Sub-PAGE_SIZE writes are unchanged: the merge path
>     handles them without needing a fresh page.
>
>     The prealloc array, its count, and the helpers operating on it live in
>     struct anon_pipe_prealloc:
>
>       - anon_pipe_get_page_prealloc() issues a single alloc_pages_bulk()
>         call before the mutex is taken.
>       - anon_pipe_free_pages() runs after mutex_unlock() and put_page()s
>         any leftover prealloc pages, keeping put_page() off the critical
>         section. Leftovers happen when the per-pipe tmp_page[] cache or the
>         merge path covered some of what we preallocated, or when the write
>         exits early.
>
>     This can improve the pipe throughput up to 48% and reduce the latency in
>     33%.
>
>     Suggested-by: Mateusz Guzik <[email protected]>

I have not suggested preallocation, so this credit is not warranted.

>     Signed-off-by: Breno Leitao <[email protected]>
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..a6441c44f1e3f 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,7 +111,32 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>         pipe_lock(pipe2);
>  }
>
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +struct anon_pipe_prealloc {
> +       struct page *pages[PIPE_PREALLOC_MAX];
> +       unsigned int count;
> +};
> +
> +/*
> + * Bulk pre-allocate pages outside pipe->mutex. Callers must gate on
> + * total_len > PAGE_SIZE; single-page writes are handled by the in-lock
> + * alloc_page() fallback. Any pages not consumed by the write loop are
> + * freed by anon_pipe_free_pages() after the mutex is dropped.
> + */
> +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +                                       size_t total_len)
> +{
> +       unsigned int want = min_t(unsigned int,
> +                                 DIV_ROUND_UP(total_len, PAGE_SIZE),
> +                                 PIPE_PREALLOC_MAX);
> +
> +       prealloc->count = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
> +                                          want, prealloc->pages);
> +}
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +                                      struct anon_pipe_prealloc *prealloc)
>  {
>         for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>                 if (pipe->tmp_page[i]) {
> @@ -121,6 +146,15 @@ static struct page *anon_pipe_get_page(struct 
> pipe_inode_info *pipe)
>                 }
>         }
>
> +       if (prealloc->count) {
> +               unsigned int idx = prealloc->count - 1;
> +               struct page *page = prealloc->pages[idx];
> +
> +               prealloc->pages[idx] = NULL;
> +               prealloc->count = idx;
> +               return page;
> +       }

I would drop the idx var.

*maybe* swap macro can help here

> +
>         return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>
> @@ -139,6 +173,17 @@ static void anon_pipe_put_page(struct pipe_inode_info 
> *pipe,
>         put_page(page);
>  }
>
> +/* Called after pipe->mutex is dropped, to keep put_page() off the critical 
> section. */
> +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
> +{
> +       while (prealloc->count) {
> +               unsigned int idx = prealloc->count - 1;
> +
> +               put_page(prealloc->pages[idx]);
> +               prealloc->count = idx;
> +       }
> +}
> +
>  static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>                                   struct pipe_buffer *buf)
>  {
> @@ -438,6 +483,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         ssize_t chars;
>         bool was_empty = false;
>         bool wake_next_writer = false;
> +       struct anon_pipe_prealloc prealloc = {};
>

I forgot this bit is going to be a problem with gcc. I verified it
emits rep stosq in place, which is going to completely unnecessarily
slow things down especially on older uarchs. This is a known bug with
gcc doing terrible job optimizing this.

The problem will be avoided by merely initializing the count to 0.
which looks kind of ugly if done here, but see below.

>         /*
>          * Reject writing to watch queue pipes before the point where we lock
> @@ -455,6 +501,18 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter 
> *from)
>         if (unlikely(total_len == 0))
>                 return 0;
>
> +       /*
> +        * Bulk pre-allocate pages outside pipe->mutex for writes that span 
> more
> +        * than one full page. alloc_page() with GFP_HIGHUSER may sleep doing
> +        * reclaim and runs memcg charging, so doing it under the mutex
> +        * extends the critical section and stalls the reader. The merge path
> +        * handles sub-PAGE_SIZE writes without a fresh page; single-page and
> +        * >PIPE_PREALLOC_MAX-page writes fall back to alloc_page() under the
> +        * mutex for the remainder.
> +        */
> +       if (total_len > PAGE_SIZE)
> +               anon_pipe_get_page_prealloc(&prealloc, total_len);
> +

I don't think this comment belongs here, it should move above the
prealloc routine.

How about this: anon_pipe_get_page_prealloc gets called
unconditionally and expects an uninitialized prealloc struct. For
total_len > PAGE_SIZE you roll with the current code. Otherwise you
just set ->count to 0, which prevents the ->pages array from being
looked at. You can even pre-set to 0 on entry, just don't memset the
entire obj.

>         mutex_lock(&pipe->mutex);
>
>         if (!pipe->readers) {
> @@ -512,7 +570,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                         struct page *page;
>                         int copied;
>
> -                       page = anon_pipe_get_page(pipe);
> +                       page = anon_pipe_get_page(pipe, &prealloc);
>                         if (unlikely(!page)) {
>                                 if (!ret)
>                                         ret = -ENOMEM;
> @@ -579,6 +637,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         if (pipe_is_full(pipe))
>                 wake_next_writer = false;
>         mutex_unlock(&pipe->mutex);
> +       anon_pipe_free_pages(&prealloc);
>
>         /*
>          * If we do do a wakeup event, we do a 'sync' wakeup, because we

Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write

Reply via email to