On 22/08/2025 04:15, Collin Funk wrote:
Hi Pádraig,

Pádraig Brady <[email protected]> writes:

Re performance, it's good, but it would be great if
we could maintain the LC_ALL=C performance like the i18n patch does.
Some quick testing shows:

   $ yes `seq 100` | head -n 1M > file.in

   # Note /bin/fold has the the (Fedora) i18n patch applied
   $ for L in en_US.UTF-8 C; do
       for FOLD in src/fold src/fold-c /bin/fold; do
         printf "LC_ALL=$L $FOLD: "
         time LC_ALL=$L $FOLD < file.in | wc -l
       done
     done

   LC_ALL=en_US.UTF-8 src/fold: 4194304
   real 0m1.046s
   LC_ALL=en_US.UTF-8 src/fold-c: 4194304
   real 0m8.294s
   LC_ALL=en_US.UTF-8 /bin/fold: 4194304
   real 0m11.556s
   LC_ALL=C src/fold: 4194304
   real 0m0.979s
   LC_ALL=C src/fold-c: 4194304
   real 0m8.277s
   LC_ALL=C /bin/fold: 4194304
   real 0m0.976s

I.e. we beat the i18n patch implementation,
but we don't shortcut the LC_ALL=C case.

I thought of a different way to handle this. Thank you for the simple
test case. The performance of this patch is much better in the LC_ALL=C
case, but still slower.

These are compiled with -O3 and the latest Fedora 42 binaries for my
system packages. Here are the sizes:

     $ size src/fold
        text       data     bss     dec     hex filename
       45343       1020     480   46843    b6fb src/fold
     $ size src/fold-c
        text       data     bss     dec     hex filename
       46487       1036     512   48035    bba3 src/fold-c


And the results of your benchmark:

     LC_ALL=en_US.UTF-8 src/fold: 4194304
     real       0m0.937s
     LC_ALL=en_US.UTF-8 src/fold-c: 4194304
     real       0m1.787s
     LC_ALL=en_US.UTF-8 /bin/fold: 4194304
     real       0m7.055s
     LC_ALL=C src/fold: 4194304
     real       0m0.864s
     LC_ALL=C src/fold-c: 4194304
     real       0m1.818s
     LC_ALL=C /bin/fold: 4194304
     real       0m0.940s


Note that the original V1 patch has the benefit of supporting any
encoding that the system supports. This one uses mcel which doesn't
support legacy encodings like Shift JIS. I think the world has decided
upon UTF-8 though, so I doubt anyone will mind.

That is better. I agree the focus on UTF8 is prudent.
Given the tradeoffs here, this seems like the best approach.

This is 4x faster (than i18n patch) in the normal case,
and 2x slower in the LC_ALL=C case.

Fold being focused on text, and usually reasonable amounts of text,
this is a good tradeoff.

Lower level (set) operations like sort, uniq, join
would have more important perf constraints.

I'd tweak NEWS to spell out "count multi-byte characters"
rather than just "count characters".

Since we're changing --spaces handling too,
to would be good to incorporate spaces mentioned
in tests/wc/wc-nbsp.sh in fold-characters.sh

thanks!
Padraig

Reply via email to