Re: [sqlite] SQLite and html character entities

Trevor Talbot Thu, 20 Sep 2007 23:35:46 -0700

On 9/20/07, P Kishor <[EMAIL PROTECTED]> wrote:
> On 9/20/07, Trevor Talbot <[EMAIL PROTECTED]> wrote:
> > On 9/20/07, P Kishor <[EMAIL PROTECTED]> wrote:


> > > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > > "the first record"
> > > "\351 \347 \361 \356"
> > > "more from 3rd row"
> > > "row four"
> > > "these \223volunteered\224 activities"
> > > "<\341 \370 \343 \374 \356 & others>"
> > > foo.csv (END)
> > > -----------------------------
> >
> > Note that this is *not* UTF-8.  If you're still using this as test
> > data, you need to get rid of it and use UTF-8 encoded data instead.

> this is where I lost you... when you say "this" is not UTF8, what is
> "this"?

The data in the file shown by less, and since sqlite3 exported that
data exactly as it was stored, the data in the db as well.

> All I want is that I want (1) the user to be able to type ç in
> the web form, and (2) I want to be able to save ç in the db. (3) Then
> when I look at that data, either on the command line, but definitely
> back on the web, I want it to appear as ç. (4) If I export it, I
> should still be able to see it as ç and not something else.
>
> Seems like I was able to do 1, 2, and 3 with my test case, but not 4
> (I got \347 instead ç).
>
> Also, in my production case, 1,2, and 3 are not very reliable. Are you
> saying my data above are not UTF8? If so, I would like to know how you
> can tell that, so I can recognize it in the future myself. Also, I
> would like to know how I can do what you are suggesting I should do,
> that is, how can I ensure that I "use UTF8 encoded data"?

Okay, first a quick primer on character sets and encodings.  A byte
can hold one of 256 different values (0-255), and most processing
tends to happen on bytes, so it makes sense that individual characters
should be stored as individual bytes.

First we have US ASCII, the character encoding standard that defines
128 characters, including the basic english alphabet, numbers, and
some punctuation (www.asciitable.com).  However, this obviously
doesn't cover all the symbols in common use, or characters from other
languages, so more definitions are needed.  Given that a byte supports
twice as many values (ASCII takes up only half), that leaves 128
values for other purposes.  Many other character sets keep the bottom
half as ASCII, and assign different characters to the top 128 values.
The ISO-8859 family of standards works this way.

ISO-8859-1 is also known as Latin-1, and is most common for languages
that use characters similar to English, Spanish, etc.  It adds a few
more symbols (copyright, paragraph, etc) and some common characters
with diacritical marks (like é ç ñ î).  The data you posted above was
entered into your database using this encoding (or Windows-1252, which
is identical except for adding some characters in places 8859-1 does
not use).

ISO-8859-2 is also known as Latin-2, and covers another set of
European languages (such as Romanian).  It contains a different set of
symbols and characters with diacritical marks needed for these
languages, characters that don't fit in 8859-1.

It keeps going, of course (Wikipedia has info:
http://en.wikipedia.org/wiki/Category:ISO_8859).  There are many other
encodings that work this way, and collectively they're known as
single-byte encodings: they all represent a character as a single
byte, but the actual meaning of that byte depends in the character set
in use.

This situation is ripe for confusion, since interpreting a sequence of
bytes as being in a different encoding than it was stored in will lead
to strange results.  This is exactly what you saw in your Cocoa
editor, since it defaulted to using the classic MacRoman encoding,
which uses those same byte values to store uppercase characters
instead.

It gets worse: there are multi-byte encodings too.  You typically see
these in the East Asian languages, since they don't use the same
alphabetic writing system, and instead have thousands of characters to
encode.  A byte only supports a mere 256 values, so more than one byte
is needed to represent a single character.

By now you can see how this can spiral into an unmaintainable mess:
you have to worry about this encoding and that encoding and you can
store the encoding with the text but what do you do if someone
requests data in another encoding and what if they are using a
specific encoding but that text only contains ASCII characters and
therefore everyone should see it anyway and how do you tell the
difference and *brain asplode*

Enter Unicode, which has the goal of putting all the world's commonly
used language characters and symbols into one single character set.
By using Unicode, you don't have to worry about which character set
your data is in, and you can move on to other more interesting issues.
 Of course, it's a very large character set, supporting just over 1
million characters.  Obviously these don't all fit in one byte, so
there are also several standard encodings.  UTF-8 is one of those, and
the most common one for Internet use.

UTF-8 was designed to be ASCII-friendly: the first 128 byte values
(0-127) are identical to ASCII.  All Unicode characters beyond those
first 128 are represented using 2 or more bytes in a row, with each
byte having a value in the range 128-255.

The tool you used to look at the CSV file you wrote out, less, is
designed for ASCII use.  When it encounters a byte it can't print as
an ASCII character, it displays the value using an octal escape
sequence instead (\351 etc).  (Many other unixy tools accept such
escape sequences as input, so even if you can't print or type such
characters you can still work with them using your keyboard.)

With the above two items in mind, that's how I knew it wasn't UTF-8
right away: you showed 4 characters that were not in the ASCII range.
In UTF-8, they would require at least 2 bytes each for a total of 8+
bytes, but the output from less showed only 4 bytes.  Therefore, it's
some single-byte encoding, and not UTF-8.  (Familiarity with the
issues and some staring at code charts on Wikipedia let me reach the
conclusions about ISO-8859-1/Win-1252 and MacRoman, but you don't need
to go that far just to check for UTF-8.)

Okay, so now you want to make sure you always use UTF-8.  I'll defer
to others for the web situation, as I simply don't have enough
experience there.  They're probably right that at this point you need
to look at the Perl CGI side of things.

As for working with the database directly, probably the easiest thing
you can do is change the encoding Terminal.app uses.  I'm on Panther
and don't have a Tiger machine to test with, but as far as I'm aware
it works the same way.  Under the Terminal -> Window Settings menu,
Display section, you can set the encoding to UTF-8.  This will cause
it to translate byte sequences on both input and display.

With that change made, try starting sqlite3 and inserting those
characters above directly on the command line.  Do the CSV output
again, then you can repeat the viewing with less and your Cocoa editor
to see if you get different results.  There's another easy check you
can do though, with the file utility:

    file foo.csv

It should say something like "foo.csv: UTF-8 Unicode text" if those
characters were output as UTF-8, and "foo.csv: ISO-8859 text" for the
file you originally made.

Once you have some sample data stored in sqlite that way, you'll be
able to test the storage and display parts of your web app separately,
which should help narrow down problems you encounter there.

There are also some third-party GUI database management tools; perhaps
someone here can recommend a good one for OS X that supports UTF-8?
I've yet to try any myself.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] SQLite and html character entities

Reply via email to