2015-06-06 21:49:16 +0300, Valdis Vītoliņš:
> Note, that UTF-8 characters can be counted by counting bytes with bit
> patterns 0xxxxxxx or 11xxxxxx:
> https://en.wikipedia.org/wiki/UTF-8#Description
> So, general logic should be, that, if:
> a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
> b) first two bytes of file are 0xFE 0xFF
> https://en.wikipedia.org/wiki/Byte_order_mark
> then count bytes with bits 0xxxxxxx and 11xxxxxx.

Except that only valid characters should be counted. And there,
the definition of valid character is not always clear.

At least an incorrect UTF-8 encoding can't count as valid


printf '\300' | wc -m

should return 0 as 11000000 alone is not a valid character so we
can't use your algorithm without first verifying the validity of
the input.

Then the UTF-8 encoding of the UTF16 surrogate pairs (0xD800 to
0xDFFF) should probably be excluded as well:

printf '\355\240\200' | wc -m

should return 0 for instance..

And maybe code-points above 0x11FFFF now since Unicode seem to
have given up on ever defining characters above that (probably
because of the UTF16 limitation).

Now even in the range 0 -> D700, E000-> 0x11FFFF, there are
still thousands of code points that are not defined yet in the
latest Unicode version. I suppose we can imagine locale
definitions  where each of the known characters are listed and
the rest rejected...


Reply via email to