Re: [PATCH] linebuffer: Significantly improve perf of many tools via freadptr/memchr

Pádraig Brady Tue, 24 Mar 2026 05:45:30 -0700

On 24/03/2026 09:31, Chris Down wrote:

readlinebuffer_delim is used internally by many major coreutils
programs, like uniq, comm, join, nl, and shuf. The previous
implementation read input one byte at a time via getc(). Although getc
is remapped to getc_unlocked (so we avoid stdio locking), the per-byte
function call overhead is still significant. For example, with a 200 MB
file with 5M lines, there are 200 million getc_unlocked calls, and those
clearly dominate perf profiles with 91% of uniq runtime and 78% of comm
runtime.


I think we can do this more efficiently. First, we can use freadptr() to
peek at the stdio internal read buffer without copying, which resolves
to a direct struct member access on glibc (_IO_read_ptr/_IO_read_end),
BSD (_p/_r), and most other platforms.

Second, we can use memchr() to scan the available buffered data for the
delimiter. On modern glibc memchr() is SIMD optimised (even without
-march=native since it's called as IFUNC), so we do a significantly
fewer number of vectorised scans.

Third, once the delimiter position is known, we can use fread() (which
is remapped to fread_unlocked() on glibc) to consume exactly those bytes
from the stream into our linebuffer. Since the data is already in the
stdio buffer, this is effectively a memcpy.

On platforms where freadptr() is not supported, we can fall back to the
original per-byte getc loop. Performance on those platforms is therefore
essentially unchanged from before this patch.

Using the following input data:

   LC_ALL=C tr -dc "A-Za-z0-9 \t\n" < /dev/urandom | head -c 80000000 \
     | fold -w 80 > random_lines
   LC_ALL=C sort random_lines > sorted_lines
   LC_ALL=C sort -u random_lines > uniq_sorted_lines
   awk "{for(i=1;i<=5;i++) print}" sorted_lines \
     | head -5000000 > repeated_sorted
   awk 'NR<=800000 {printf "%07d %s\n", NR,          $0}' \
     sorted_lines > join_file1
   awk 'NR<=800000 {printf "%07d %s\n", int(NR*1.5), $0}' \
     sorted_lines > join_file2
   awk 'NR<=800000 {printf "%07d %s\n", NR*2,        $0}' \
     sorted_lines > join_nooverlap1
   awk 'NR<=800000 {printf "%07d %s\n", NR*2-1,      $0}' \
     sorted_lines > join_nooverlap2

And the following benchmarks:

   hyperfine --warmup 5 --runs 30 \
     "uniq_before repeated_sorted > /dev/null" \
     "uniq_after repeated_sorted > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "uniq_before uniq_sorted_lines > /dev/null" \
     "uniq_after uniq_sorted_lines > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "LC_ALL=C comm_before sorted_lines sorted_lines > /dev/null" \
     "LC_ALL=C comm_after sorted_lines sorted_lines > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "LC_ALL=C comm_before sorted_lines uniq_sorted_lines > /dev/null" \
     "LC_ALL=C comm_after sorted_lines uniq_sorted_lines > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "LC_ALL=C join_before join_file1 join_file2 > /dev/null" \
     "LC_ALL=C join_after join_file1 join_file2 > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "LC_ALL=C join_before join_nooverlap1 join_nooverlap2 > /dev/null" \
     "LC_ALL=C join_after join_nooverlap1 join_nooverlap2 > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "nl_before random_lines > /dev/null" \
     "nl_after random_lines > /dev/null"

   hyperfine --warmup 5 --runs 30 \
     "shuf_before random_lines > /dev/null" \
     "shuf_after random_lines > /dev/null"

The results on i9-14900HX x86_64 with -O2:

   uniq (5M repeated sorted lines, high repetition):
     Before: 328.2 ms    After: 148.8 ms    (-54.7%)

   uniq (1.7M unique sorted lines, typical):
     Before: 131.3 ms    After:  71.1 ms    (-45.8%)

   comm (1.8M sorted lines, self-compare):
     Before: 284.9 ms    After: 143.0 ms    (-49.8%)

   comm (1.8M lines, different sorted files, typical):
     Before: 329.9 ms    After: 171.1 ms    (-48.2%)

   join (800K lines, ~2/3 key overlap):
     Before: 342.6 ms    After: 264.8 ms    (-22.7%)

   join (800K lines, no key overlap):
     Before: 341.4 ms    After: 264.7 ms    (-22.5%)

   nl (1.8M lines):
     Before: 237.6 ms    After: 174.9 ms    (-26.4%)

   shuf (1.8M lines):
     Before: 207.1 ms    After: 207.9 ms    (no change)

And on M1 Pro aarch64 with -O2:

   uniq (5M repeated sorted lines, high repetition):
     Before: 536.8 ms    After: 219.5 ms    (-59.1%)

   uniq (1.7M unique sorted lines, typical):
     Before: 202.5 ms    After:  85.8 ms    (-57.6%)

   comm (1.8M sorted lines, self-compare):
     Before: 438.9 ms    After: 179.1 ms    (-59.2%)

   comm (1.8M lines, different sorted files, typical):
     Before: 418.0 ms    After: 214.1 ms    (-48.8%)

   join (800K lines, ~2/3 key overlap):
     Before: 759.5 ms    After: 622.1 ms    (-18.1%)

   join (800K lines, no key overlap):
     Before: 706.1 ms    After: 577.6 ms    (-18.2%)

   nl (1.8M lines):
     Before: 332.8 ms    After: 215.4 ms    (-35.3%)

   shuf (1.8M lines):
     Before: 282.9 ms    After: 282.9 ms    (no change)

You might wonder why there's no change in shuf. Profiling shows shuf
spends its time almost entirely in randperm_new() and randint_genmax(),
so I/O is not the bottleneck.
---
  lib/linebuffer.c   | 80 ++++++++++++++++++++++++++++++++++------------
  modules/linebuffer |  1 +
  2 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/lib/linebuffer.c b/lib/linebuffer.c
index 36642a1db1..814bd0a84d 100644
--- a/lib/linebuffer.c
+++ b/lib/linebuffer.c
@@ -25,6 +25,7 @@
  #include <string.h>
  #include <sys/types.h>
  #include "linebuffer.h"
+#include "freadptr.h"
  #include "xalloc.h"

#if USE_UNLOCKED_IO

@@ -62,35 +63,74 @@ readlinebuffer_delim (struct linebuffer *linebuffer, FILE 
*stream,
    if (feof (stream))
      return NULL;

+ idx_t pos = 0;

    char *buffer = linebuffer->buffer;
-  char *p = linebuffer->buffer;
-  char *end = buffer + linebuffer->size; /* Sentinel. */
-  int c;
+  idx_t alloc = linebuffer->size;

- do

+  for (;;)
      {
-      c = getc (stream);
-      if (c == EOF)
+      size_t avail;
+      const char *bufp = freadptr (stream, &avail);
+      if (bufp)
          {
-          if (p == buffer || ferror (stream))
-            return NULL;
-          if (p[-1] == delimiter)
-            break;
-          c = delimiter;
+          const char *found = memchr (bufp, delimiter, avail);
+          size_t nbytes = found ? (size_t) (found - bufp + 1) : avail;
+
+          if (alloc - pos < (idx_t) nbytes)
+            {
+              buffer = xpalloc (buffer, &alloc,
+                                nbytes - (alloc - pos), -1, 1);
+              linebuffer->buffer = buffer;
+              linebuffer->size = alloc;
+            }
+
+          size_t nread = fread (buffer + pos, 1, nbytes, stream);
+          pos += nread;
+
+          if (nread < nbytes)
+            {
+              /* Short read despite freadptr indicating data was available.
+                 Return immediately on error.  Otherwise fall through to the
+                 getc path, which will handle EOF and synthesise a trailing
+                 delimiter if needed.  */
+              if (ferror (stream))
+                return NULL;
+              continue;
+            }
+
+          if (found)
+            {
+              linebuffer->length = pos;
+              return linebuffer;
+            }
          }
-      if (p == end)
+      else
          {
-          idx_t oldsize = linebuffer->size;
-          buffer = xpalloc (buffer, &linebuffer->size, 1, -1, 1);
-          p = buffer + oldsize;
-          linebuffer->buffer = buffer;
-          end = buffer + linebuffer->size;
+          /* Buffer empty or not accessible.  Consume one byte via getc to
+             trigger a refill, then loop back to use freadptr on the newly
+             populated buffer.  */
+          int c = getc (stream);
+          if (c == EOF)
+            {
+              if (pos == 0 || ferror (stream))
+                return NULL;
+              if (buffer[pos - 1] == delimiter)
+                break;
+              c = delimiter;
+            }
+          if (pos >= alloc)
+            {
+              buffer = xpalloc (buffer, &alloc, 1, -1, 1);
+              linebuffer->buffer = buffer;
+              linebuffer->size = alloc;
+            }
+          buffer[pos++] = c;
+          if (c == delimiter)
+            break;
          }
-      *p++ = c;
      }
-  while (c != delimiter);

- linebuffer->length = p - buffer;

+  linebuffer->length = pos;
    return linebuffer;
  }

diff --git a/modules/linebuffer b/modules/linebuffer

index b21af67ce5..91cde9cd6f 100644
--- a/modules/linebuffer
+++ b/modules/linebuffer
@@ -6,6 +6,7 @@ lib/linebuffer.h
  lib/linebuffer.c

Depends-on:

+freadptr
  idx
  xalloc

base-commit: 0cca364da074782231685f955da2f0ce67d8c155


Thanks a lot for working on this.

The idea is sound. I also biased towards using memchr() and strstr()
with an update to cut(1) I'm working on, as the platform specific
optimizations for those in glibc are a significant win.

One of your observances confused me though.
getc() and putchar() should also avoid function calls,
and degenerate via inlining and  macros to directly manipulate stdio mem,
periodically calling __uflow() and __overflow() respectively.

Compiling uniq(1) on Fedora 43 here with default (-O2) options I see:

$ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in
$ ltrace -c uniq as.in
eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 97.95   57.971874          55   1048575 memcmp
  2.05    1.211896          75     16129 __uflow
...

So function calls happen only once per 4KiB rather than per byte.

BTW fwrite(...,1) is also similarly optimized,
but var=1; fwrite(...,var) is not, so I'm also doing the attached
in my cut(1) update to avoid the function call overhead.

cheers,
Padraig

From 9ceaf25134001625e567e258b021c5893db90094 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Mon, 23 Mar 2026 17:42:27 +0000
Subject: [PATCH] cut: avoid fwrite calls for smaller amounts of data

This is quite significant:

yes abcdfeg | head -n1MB > big-file

$ time src/cut-before -b1,3 big-file >/dev/null
real	0m0.050s

$ time src/cut-after -b1,3 big-file >/dev/null
real	0m0.029s
---
 src/cut.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/cut.c b/src/cut.c
index e1cc642a5..7b28a9c50 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -405,6 +405,16 @@ skip_whitespace_run (mbbuf_t *mbuf, struct mbfield_parser *parser,
 static void
 write_bytes (char const *buf, size_t n_bytes)
 {
+  /* Avoid a function call for smaller amounts,
+     using instead the macro to directly interact with the stdio buffer.  */
+  if (n_bytes <= 4)
+    {
+      for (size_t i = 0; i < n_bytes; i++)
+        if (putchar (to_uchar (buf[i])) < 0)
+          write_error ();
+      return;
+    }
+
   if (fwrite (buf, sizeof (char), n_bytes, stdout) != n_bytes)
     write_error ();
 }
-- 
2.53.0

Re: [PATCH] linebuffer: Significantly improve perf of many tools via freadptr/memchr

Reply via email to