Igor Tandetnik, on Monday, November 11, 2019 02:56 PM, wrote... > > On 11/11/2019 12:30 PM, Jose Isaias Cabrera wrote: > > > > Igor Tandetnik, on Monday, November 11, 2019 11:02 AM, wrote... > >>> Most people have to figure out what Unicode they are using, count the > >>> bytes, divide > >>> by... and on, and on. Not me, I just take that UTF8, or UTF16 string, > >>> convert it to > >>> UTF32, and do a count. > >> > >> And then what do you do with that count? What do you use it for? > > > > Say that I am writing a report and I only want to print the first 20 > > characters of a string > A sequence of Unicode codepoints U+006F U+0302 U+0301 should be rendered as a > single grapheme > ( ố ) - what a human would think of as a "character". This is an actual > character in > Vietnamese. Now, if you have several such triplets in a row in your string, > and you chop it at > 20 codepoints, you'll only print 7 graphemes / "characters". Moreover, you'll > end up dropping > the last combining accent, producing a different grapheme (ô) and > potentially altering the > meaning of the text. (Don't know how much of a danger this is in Vietnamese, > but I know that > combining viramas https://www.compart.com/en/unicode/combining/9 are vital to > Indic languages, > and dropping one will in fact often produce a valid but different word).
Yes, dropping pieces of words is a problem in any language. _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users