On 10/05/2009 3:29 PM, kalyan....@aol.in wrote: > Hi, > > I am a newbie to sqlite. > I want to create a sqlite database in Linux OS. > > The data that I wish to put in the table contains both ASCII and UTF-16 > encoded strings. > For eg. File and directory names are UTF-16 where the URL, date time > information about the files are in ASCII. > > How do I create a table that holds both ASCII and UTF-16 strings. > > The documentation says that TEXT datatype can hold both the encodings, > if i do a sqlite3_open -> TEXT is only ASCII, UTF-8 > if i do a sqlite3_open16 -> TEXT is only UTF-16
AFAICT ... and I am by no means omniscient :-) ... the story is something like this: There are THREE possibilities for how the data is declared to be stored. They are: UTF-8, UTF-16LE, UTF-16BE. You must choose ONE of the three. The choice is made when a database is created and is applied to the whole database. There is no way of making a choice per table. When you subsequently feed data into SQLite, you have two [1] choices: (a) you say that you are supplying bytes that are encoded in UTF-8 (b) you say that you are supplying wide chars that are encoded in the UTF-16xE that is appropriate to the endianness of the machine that you are running on. If the choice differs from that recorded in the database, SQLite will attempt to convert your data for you. If the declaration(s) that you make implicitly is/are untrue e.g. your data is encoded in ISO 8859-xx but you say that it is encoded in UTF-8, you may not get any error message. SQLite3 does little checking. Which choice you make for database storage would depend on two things: (1) the time required for any transcoding (1.1) by SQLite automatically as described above, and/or (1.2) by you to convert raw input into the encoding that you will use to feed your data to SQLite (2) the resultant size of the database: [following assumes all of your data is in the range U+0000 to U+FFFF] If stored in UTF-16xE, each character will occupy 2 bytes. If stored in UTF-8, the bytes per char is variable: U+0000 to U+007F (ASCII) -> 1 byte per char U+0080 to U+07FF (covers e.g. accented-and-otherwise-adorned Latin letters, Greek, Cyrillic, Hebrew, Arabic) -> 2 bytes per char U+0800 to U+FFFF (covers e.g. Devanagari and other Indian languages, Chinese, Japanese, Korean) -> 3 bytes per char So it's impossible to say whether UTF-8 would be better/same/worse without examining your actual data. You could write a script that loaded a typical data mix into a database, and run it with the database encoding specified as either UTF-8 or UTF-16xE and compare the resultant sizes (and time difference). [1]: If you are using a wrapper like e.g the Python sqlite3 module, your choices may be different. Possibilities include: you supply a unicode object or you supply a str object (8-bit characters) plus an encoding (which may default to UTF-8) and the wrapper will do whatever is necessary. HTH, John _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users