On Fri, Apr 12, 2019 at 4:51 PM x <tam118...@hotmail.com> wrote:

> I’m still confused by utf strings. [...  I want to scan the string to
> count the number of occurrences of a certain character. [...]
> How do I do the same thing if the string param is a utf-8 or utf-16 string
> and the SearchChar is a Unicode character?
>
> I’m confused by the fact that Unicode characters are not a fixed number of
> bytes so if I do this e.g.
>
> wchar_t *c = (wchar_t*) sqlite3_value_text(0);


That's very wrong. _text() always returns UTF8. the _text16*() variants
return UTF16.

As to how many bytes a UTF8-encoded "codepoint" takes, it's well documented
on Wikipedia.
Based on the leading bits, one can know unambiguously whether this is the
1st, 2nd, 3rd, or 4th
byte of a 1 to 4 multi-byte sequence.

Even UTF16 can lead to "surrogate pairs" for codepoints beyond the
so-called "CMP".

And that's not even getting into the fact the encoding may not be "unique",
and Unicode "normalization".
This is not an easy subject...

You can play with the char() built-in SQL function to see how different
code point values are encoded in UTF8. --DD
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to