-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Zbigniew Baniewski wrote: > How one should handle this? SQLite has UTF-8 by default.
You seem to doubt being all Unicode is a good thing :-) Read this http://www.joelonsoftware.com/articles/Unicode.html > What C-function (Linux) could be considered as most convenient? Perhaps > there's a doc with explanation (in the context of SQLite-usage)? SQLite does not include conversion from random non-Unicode encodings to or from Unicode. (It does include conversion between 8 bit and 16 bit Unicode encodings). If you just want a simple bytes in give the same bytes out then use blobs in SQLite. If you think your bytes are actually strings then reread the link above again :-) To do the conversion within your code you should use iconv http://en.wikipedia.org/wiki/Iconv If you want to do manipulation of the text (once it is in unicode) such as upper/lower casing or sorting then you need to know about locales. This is because the exact same sequence of characters sort, upper case, lower case etc differently depending on where you are. As an example Turkic languages have multiple letter i, German has ß which behaves like s, various accents sort differently in different European countries. Fortunately there is a libary you can ask to do the right locale specific thing http://en.wikipedia.org/wiki/International_Components_for_Unicode A default SQLite compilation only deals with the 26 letter Roman alphabet. If you enable ICU with SQLite then you get good stuff http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt (*) You Linux distribution almost certainly has iconv binary and libraries already installed. ICU should be installed already or easily installable via your package manager. (*) Viewing that page is a good example of how messy this gets. The actual README.txt is encoded in UTF8. However the cvstrac web server tells the browser that it is encoded as ANSI_X3.4-1968 (a fancy name for ASCII). If you scroll to just before section 1.2 you can see the Turkish lower case dotless i being mangled. I like to test using the front page of http://Wikipedia.org as it contains the names of a wide variety of languages in those languages and hence uses a wide sampling of Unicode characters. In summary, never confuse bytes with strings (which C sadly treats as the same thing). Either always uses bytes (and SQLite blobs) for everything or use strings (and SQLite strings) for everything. If you take the latter approach and have to deal with external input/output then you must know what encodings are being used and it is best to convert to Unicode as early as possible on input and late as possible on output. Roger -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFI9qkcmOOfHg372QQRAkO+AJ9rXxdLkyjgZGYUS+W3RMmOJel0ZgCg44e2 7FpA+U2cn0DusHMSR0ZEl8Q= =a9T7 -----END PGP SIGNATURE----- _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

