On 2023-07-03 15:00, Bruno Haible wrote:
Level 3: Behave correctly. Don't split a 2-Unicode-character sequence.
This is what code that uses mbrtoc32() does, when it has the
lines
if (bytes == (size_t) -3)
bytes = 0;
and uses !mbsinit (&state) in the loop termination condition.
With diffutils even level 3 would not suffice, since diffutils truncates
at input byte boundaries, so it doesn't suffice to merely treat (size_t)
-3 as zero even if one also checks mbsinit. Instead, one would have to
treat all the characters in the sequence ABBB... (where A is an ordinary
multibyte character and the Bs all return (size_t) -3) as a single unit,
because one cannot truncate in the middle of that sequence. Or wait a
minute - in theory I suppose it could even be an arbitrary sequence of
As and Bs, so long as the total "sizes" of the As equals the number of
bytes in the original byte sequence that stands for a series of characters.
The diffutils truncation approach also has problems with coding systems
that have shift state, but that's OK: nobody uses these coding systems
with GNU apps as they're not practical. Similarly, any platform where
mbrtoc32 returns (size_t) -3 won't be practical with GNU apps, so it
should be OK for diffutils to not worry about this possibility either,
given that it would be a hassle to support it. We don't have time to
support every oddball coding system that POSIX allows.