-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/28/2010 08:58 AM, Drake Wilson wrote:
> Quoth "J. Bobby Lopez" <j...@jbldata.com>, on 2010-10-28 11:48:12 -0400:
>> Another think that crossed my mind is that maybe I haven't set up the
>> database properly to accept UTF8 or UTF16 data, but I figured this was a
>> default in SQLite3.
> 
> You have to pick one when you create the database, usually UTF-8.  If
> you want UTF-16 use « PRAGMA encoding = 'UTF-16' » (or 'UTF-16le' or
> 'UTF-16be') when you create the database.

Just to be clear all the SQLite string APIs accept/produce UTF8.  There
are also some that accept/produce UTF16 and have a 16 suffix for the
function name.  The underlying encoding of the database has no effect on
what happens at the API level - you will always get the same answers.

You can however specify the database encoding as an optimisation.  For
example if you are predominantly using codepoints above 0x800 then UTF8
requires more bytes to encode the string than UTF16 (3 or more per
codepoint versus 2).  Choosing a UTF16 encoding in this example could
potentially save you 33% of the text storage in the file.

Another optimisation may be that you have a user defined function or a
collation that is significantly more efficient on UTF16 than UTF8.
Counting the number of codepoints is one example.  When you register the
udf/collation with SQLite you can specify which encodings it can work
with.  SQLite will always make the conversions before calling the
udf/collation.  For example if you register the udf/collation to only
handle UTF16 then SQLite will automatically convert any bytes it is
storing behind the scenes in UTF8 into UTF16 before calling.  If you use
the udf/collations a lot then it would be more efficient to store the
database in UTF16 format so you don't have these conversions going on
behind the scenes.

TL/DR: The encoding of the database is irrelevant for what you see as a
SQLite API user.  You will always get the same answers no matter which
combinations of APIs and database encoding is used.  It may be
beneficial to explicitly set the encoding as a space or cpu
optimization, but this is *very* unlikely to be the space/cpu issue with
your application.

Yes, I know about surrogate pairs and no I won't mention how they could
complicate matters.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkzJ3kkACgkQmOOfHg372QQnLgCfRYT8tDSi4HjJgPEVyAet3O4I
LI4An0Z7ovkEfb2xPK+clpXF/2hjCa/K
=fTye
-----END PGP SIGNATURE-----
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to