On Fri, Apr 12, 2019 at 4:51 PM x <tam118...@hotmail.com> wrote: > I’m still confused by utf strings. [... I want to scan the string to > count the number of occurrences of a certain character. [...] > How do I do the same thing if the string param is a utf-8 or utf-16 string > and the SearchChar is a Unicode character? > > I’m confused by the fact that Unicode characters are not a fixed number of > bytes so if I do this e.g. > > wchar_t *c = (wchar_t*) sqlite3_value_text(0);
That's very wrong. _text() always returns UTF8. the _text16*() variants return UTF16. As to how many bytes a UTF8-encoded "codepoint" takes, it's well documented on Wikipedia. Based on the leading bits, one can know unambiguously whether this is the 1st, 2nd, 3rd, or 4th byte of a 1 to 4 multi-byte sequence. Even UTF16 can lead to "surrogate pairs" for codepoints beyond the so-called "CMP". And that's not even getting into the fact the encoding may not be "unique", and Unicode "normalization". This is not an easy subject... You can play with the char() built-in SQL function to see how different code point values are encoded in UTF8. --DD _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users