On 2023-07-21 17:33, Bruno Haible wrote:
It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
- a complete character, or
- an invalid character, or
- an incomplete character (i.e. if additional bytes may lead to a
complete character).
Ah, I had thought that the idea was to treat all the bytes of a byte
sequence from 10646-1[1] R.2 Table 1 as a single invalid "character"
(i.e., not a real character) if the byte sequence is not valid UTF-8.
That's what Kuhn seems to be suggesting in [2].
But what you're saying is something different, that could be implemented
by calling mbrtoc32.
For example, as I understand it, the byte sequence F4 90 80 80, which I
had thought you were saying would be treated as a single byte sequence
[F4 90 80 80] because that's in R.2 Table 1, would instead be treated as
[F4 90] [80] [80], because [F4 90] is not an incomplete character
(additional bytes cannot lead to a complete character).
Is this right?
[1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
[2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt