On 26/08/2025 08:24, Collin Funk wrote:
Collin Funk <[email protected]> writes:
Good point.
Using fread is also faster since it doesn't check for the delimiter upon
reading a character, and doesn't need to lock the stream each call
(since we use fread_unlocked).
Will push the attached in a bit.
Cool, that was quick!
You might include something like the following test.
Actually, I realized that it is pointless to malloc line_out now. Since
the input buffer is a fixed size and we do not do case conversion or
anything that may change the number of bytes needed to represent the
output.
Will push this v2 patch in a bit instead.
Actually, the change from getline to fread made the implementation
buggy. It wasn't immediately apparent to me when I wrote the commit.
Using getline ensured that we wouldn't have incomplete multibyte
sequences at the end of the buffer. Since, at least for Unicode, the
newline character cannot be in a multibyte sequence. The use of fread
makes that no longer true.
Here is the test case that I used to make sure that I was not imagining
this issue:
$ python3 -c 'print("a" * (262144 - 1) + "뉐" + "a" * 160)' > test.txt
$ ./src/fold-old --characters test.txt
[...]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa뉐aaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
[...]
$ ./src/fold-new --characters test.txt
[...]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa뉐aaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
[...]
The first run is wrong since the line with 뉐 has 78 characters when it
should have 80 like the rest. The second version is fixed. Note that 뉐
has a width of two characters so the line looks like it is 81 characters
long (at least using my fonts).
This occurs because the first byte of 뉐, 0xEB, is the last character of
the input IO_BUFSIZE-sized buffer. The following 0x89, and 0x90 bytes
are not available until the next fread. As a result, mcel_scan will
return with the fields g.ch == 0 and g.err != 0. Since I did not check
g.err, the NUL was counted as a character.
This patch fixes the issue by checking g.err and moving possible
incomplete sequences to the front of the input buffer before reading
again. If the number of bytes available is larger than the maximum
number of bytes that can create a multi-byte sequence we know that we
have an encoding error. In that case, we print the bytes as we receive
them.
Will pushed the attached fix with a test case later today.
Good catch.
Re using IO_BUFSIZE in the test, you can push as is.
I'll follow up with a patch to add IO_BUFSIZE to getlimits,
which tests can then reference easily.
cheers,
Padraig
Collin