On 11/11/2019 12:30 PM, Jose Isaias Cabrera wrote:

Igor Tandetnik, on Monday, November 11, 2019 11:02 AM, wrote...
Most people have to figure out what Unicode they are using, count the bytes, 
divide
by... and on, and on.  Not me, I just take that UTF8, or UTF16 string, convert 
it to
UTF32, and do a count.

And then what do you do with that count? What do you use it for?

Say that I am writing a report and I only want to print the first 20 characters 
of a string
A sequence of Unicode codepoints U+006F U+0302 U+0301 should be rendered as a single grapheme ( ố 
 ) - what a human would think of as a "character". This is an actual character in 
Vietnamese. Now, if you have several such triplets in a row in your string, and you chop it at 20 
codepoints, you'll only print 7 graphemes / "characters". Moreover, you'll end up 
dropping the last combining accent, producing a different grapheme (ô) and potentially altering 
the meaning of the text. (Don't know how much of a danger this is in Vietnamese, but I know that 
combining viramas https://www.compart.com/en/unicode/combining/9 are vital to Indic languages, and 
dropping one will in fact often produce a valid but different word).
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to