Re: [sqlite] Newbie Question about ASCII and UTF-16 strings

John Machin Sun, 10 May 2009 02:46:32 -0700

On 10/05/2009 3:29 PM, kalyan....@aol.in wrote:
> Hi,
> 
> I am a newbie to sqlite.
> I want to create a sqlite database in Linux OS.
> 
> The data that I wish to put in the table contains both ASCII and UTF-16 
> encoded strings.
> For eg. File and directory names are UTF-16 where the URL, date time 
> information about the files are in ASCII.
> 
> How do I create a table that holds both ASCII and UTF-16 strings.
> 
> The documentation says that TEXT datatype can hold both the encodings,
> if i do a sqlite3_open -> TEXT is only ASCII, UTF-8
> if i do a sqlite3_open16 -> TEXT is only UTF-16


AFAICT ... and I am by no means omniscient :-) ... the story is 
something like this:

There are THREE possibilities for how the data is declared to be stored. 
They are: UTF-8, UTF-16LE, UTF-16BE. You must choose ONE of the three.

The choice is made when a database is created and is applied to the 
whole database. There is no way of making a choice per table.

When you subsequently feed data into SQLite, you have two [1] choices:
(a) you say that you are supplying bytes that are encoded in UTF-8
(b) you say that you are supplying wide chars that are encoded in the 
UTF-16xE that is appropriate to the endianness of the machine that you 
are running on.

If the choice differs from that recorded in the database, SQLite will 
attempt to convert your data for you.

If the declaration(s) that you make implicitly is/are untrue e.g. your 
data is encoded in ISO 8859-xx but you say that it is encoded in UTF-8, 
you may not get any error message. SQLite3 does little checking.

Which choice you make for database storage would depend on two things:

(1) the time required for any transcoding (1.1) by SQLite automatically 
as described above, and/or (1.2) by you to convert raw input into the 
encoding that you will use to feed your data to SQLite

(2) the resultant size of the database:
[following assumes all of your data is in the range U+0000 to U+FFFF]

If stored in UTF-16xE, each character will occupy 2 bytes.

If stored in UTF-8, the bytes per char is variable:

U+0000 to U+007F (ASCII) -> 1 byte per char
U+0080 to U+07FF (covers e.g. accented-and-otherwise-adorned Latin 
letters, Greek, Cyrillic, Hebrew, Arabic) -> 2 bytes per char
U+0800 to U+FFFF (covers e.g. Devanagari and other Indian languages, 
Chinese, Japanese, Korean) -> 3 bytes per char

So it's impossible to say whether UTF-8 would be better/same/worse 
without examining your actual data. You could write a script that loaded 
a typical data mix into a database, and run it with the database 
encoding specified as either UTF-8 or UTF-16xE and compare the resultant 
sizes (and time difference).

[1]: If you are using a wrapper like e.g the Python sqlite3 module, your 
choices may be different. Possibilities include: you supply a unicode 
object or you supply a str object (8-bit characters) plus an encoding 
(which may default to UTF-8) and the wrapper will do whatever is necessary.

HTH,
John
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Newbie Question about ASCII and UTF-16 strings

Reply via email to