On 22/08/2025 04:15, Collin Funk wrote:
Hi Pádraig,
Pádraig Brady <[email protected]> writes:
Re performance, it's good, but it would be great if
we could maintain the LC_ALL=C performance like the i18n patch does.
Some quick testing shows:
$ yes `seq 100` | head -n 1M > file.in
# Note /bin/fold has the the (Fedora) i18n patch applied
$ for L in en_US.UTF-8 C; do
for FOLD in src/fold src/fold-c /bin/fold; do
printf "LC_ALL=$L $FOLD: "
time LC_ALL=$L $FOLD < file.in | wc -l
done
done
LC_ALL=en_US.UTF-8 src/fold: 4194304
real 0m1.046s
LC_ALL=en_US.UTF-8 src/fold-c: 4194304
real 0m8.294s
LC_ALL=en_US.UTF-8 /bin/fold: 4194304
real 0m11.556s
LC_ALL=C src/fold: 4194304
real 0m0.979s
LC_ALL=C src/fold-c: 4194304
real 0m8.277s
LC_ALL=C /bin/fold: 4194304
real 0m0.976s
I.e. we beat the i18n patch implementation,
but we don't shortcut the LC_ALL=C case.
I thought of a different way to handle this. Thank you for the simple
test case. The performance of this patch is much better in the LC_ALL=C
case, but still slower.
These are compiled with -O3 and the latest Fedora 42 binaries for my
system packages. Here are the sizes:
$ size src/fold
text data bss dec hex filename
45343 1020 480 46843 b6fb src/fold
$ size src/fold-c
text data bss dec hex filename
46487 1036 512 48035 bba3 src/fold-c
And the results of your benchmark:
LC_ALL=en_US.UTF-8 src/fold: 4194304
real 0m0.937s
LC_ALL=en_US.UTF-8 src/fold-c: 4194304
real 0m1.787s
LC_ALL=en_US.UTF-8 /bin/fold: 4194304
real 0m7.055s
LC_ALL=C src/fold: 4194304
real 0m0.864s
LC_ALL=C src/fold-c: 4194304
real 0m1.818s
LC_ALL=C /bin/fold: 4194304
real 0m0.940s
Note that the original V1 patch has the benefit of supporting any
encoding that the system supports. This one uses mcel which doesn't
support legacy encodings like Shift JIS. I think the world has decided
upon UTF-8 though, so I doubt anyone will mind.
That is better. I agree the focus on UTF8 is prudent.
Given the tradeoffs here, this seems like the best approach.
This is 4x faster (than i18n patch) in the normal case,
and 2x slower in the LC_ALL=C case.
Fold being focused on text, and usually reasonable amounts of text,
this is a good tradeoff.
Lower level (set) operations like sort, uniq, join
would have more important perf constraints.
I'd tweak NEWS to spell out "count multi-byte characters"
rather than just "count characters".
Since we're changing --spaces handling too,
to would be good to incorporate spaces mentioned
in tests/wc/wc-nbsp.sh in fold-characters.sh
thanks!
Padraig