Re: [sqlite] Unicode Again... Still Stuck... A Challenge... Store and retrieve the word résumé without using a unicode string literal

Nuno Lucas Tue, 31 Jul 2007 05:07:32 -0700

First, let me start by saying I don't have much experience with
Python, but this isn't a python problem.


On 7/31/07, wcmadness <[EMAIL PROTECTED]> wrote:
[...]
> I'm on a Windows machine.  It turns out that the default code page on
> Windows is cp437.  So, in my Python code, if I type:

Wrong. The default code page on windows depends on the Windows
localization version. Your default code page can be CP437 (US ASCII)
on your machine, but can be another completely different on another
machine.

> s = 'résumé' (with the French e s), it is stored as: 'r\x82sum\x82' because
> hex 82 (decimal 130) is the code for French e in code page 437 (used by
> Windows)...
>
> OK.  So, now that I now my data comes to me from the HTML form (or in a flat
> file) in code page 437 on a Windows machine, I can do the following when I
> send the data to the database:

Wrong. It's the one who generates the HTML page who decides what
encoding the data is interpreted and passed. The default one acording
to most current standards is UTF-8, but most windows users expect it
to be ISO-8859-1 (Latin 1, which is almost identical to the first 256
characters of Unicode) and don't use the appropriate "<META>" tag to
certify that. This explains most problems with international
characters in IE (which tries to "guess" if not given and usually ends
with ISO-8859-1) and Firefox (which doesn't do the "guessing" part and
stick with the standard).

In reality, Windows doesn't use ISO-8859-1, but a variation of it
(usually WIN-1252). The main difference is that Latin 1 considers the
32 characters after 128 as control characters and Windows use this
zone to include extra characters (as an example, the Euro sign was
added there, because the Unicode character code doesn't fit on 8-bit).

You need to learn how to use the "<META>" pragmas on the HTML pages.
Without it. it's the HTTP server who decides which encoding the pages
are (if using HTTP version 1.1 and higher). which is probably wrong on
most cases.

That is the second option if all your pages are on a specific code
page. Configure the server to return a specific code page by default,
but I don't like that option as the server has no way of knowing if
the content it's serving is in fact the encoding it says it is.

You also seem to be reading data from mails. Those are other standards
which you need to read, but Outlook is famous for not following those
standards, so it means a lot of hacks to to have it right.

This are just notes for you. I'm not even an expert on this.


Regards,
~Nuno Lucas




> f = cgi.FieldStorage()
> cur.execute("insert into test values (?,?)",
> (f['txtName'].value.decode('cp437')))
>
> The decode method after the incoming form data will force a translation from
> code page 437 to unicode (from 1 byte per character according to extended
> ascii set code page 437 to 2 bytes per character -- unicode).  That's all
> fine.
>
> Now, when I get the data with:
>
> cur.execute("select * from test")
> mylist = cur.fetchall()
>
> I would expect that I would need to encode the unicode data coming from
> Sqlite to get back to my original code page 437 (of course, I could also
> just use the data as unicode).  So, I would expect to do this:
>
> (say that row one, column one has the value of résumé)
>
> In that case, the following should return me exactly to the original
> 'r\x82sum\x82'
>
> mylist[0][0].encode('cp437')
>
> But it doesn't!!! (Wacky)!
>
> Rather, it gives me this: 'r\xe9sum\xe9'
>
> Interestingly, that's almost the same as what I get with a unicode literal.
> In other words, if I write this Python code:
>
> x = u'résumé'
>
> and then type x in the shell to see what it is, I get this:
>
> u'r\xe9sum\xe9'
>
> The only difference is that the latter is unicode and the former
> ('r\xe9sum\xe9') is not.
>
> So, to get back where I started, I do the fetchall and then this wacky
> thing:
>
> eval("u'" + mylist[0][0].encode('cp437') + "'").encode('cp437')
>
> In other words, I say: OK, you're almost there.  Now, convert to unicode by
> evaluating the string as a unicode literal and then encode the unicode back
> to the code page 437.
>
> What a kludge.  It seems like an awefully lot of work to get back to the
> original data that was stored to the database.  And why?  Does anyone know
> what's going on here???
>
> Thanks.
>
>
> wcmadness wrote:
> >
> > Surely there is an answer to this question...
> >
> > I'm using Python and PySqlite.  I'm trying to store the word résumé to a
> > text field.  I'm really doing this as a test to see how to handle
> > diacritical letters, such as umlaut characters (from German) or accented
> > characters (from French).  I can produce French é on my keyboard with
> > Alt-130...
> >
> > If I were coding a string literal, I would send through the data as
> > unicode, as in: u'résumé'.  But, I'm not that lucky.  The data is coming
> > from an HTML form or from a flat file.  It will take on the default codec
> > used on my machine (latin-1).  If I just send it through as is, it has
> > problems either when I fetchall or when I try to print what I've fetched.
> > So, for example:
> >
> > Insert Into tblTest (word) values ('résumé')
> >
> > will cause problems.
> >
> > I know that Sqlite stores text data as utf-8.  I know that in Python (on
> > my machine, at least) strings are stored as latin-1.  So, for example, in
> > Python code:
> >
> > v = 'résumé'
> >
> > v would be of type str, using latin-1 encoding.
> >
> > So, I have tried sending through my data as follows:
> >
> > cur.execute("Insert Into tblTest (word) values (?)",
> > ("résumé".decode("latin-1").encode("utf-8"),))
> >
> > That stores the data just fine, but when I fetchall, I still have
> > problems.  Say, I select * from tblTest and then do:
> >
> > l = cur.fetchall()
> >
> > Doing print l[0][1]  (to print the word résumé) will give a nasty message
> > about ascii codec can't convert character \x082 (or some variation of that
> > message).
> >
> > I've tried doing:
> >
> > print l[0][1].decode('utf-8').encode('latin-1')
> >
> > But to no avail.
> >
> > The simple question is this:
> >
> > How do I store the word résumé to a Sqlite DB without using a unicode
> > literal (e.g. u'résumé'), such that printing the results retrieved from
> > fetchall will not crash?
> >
> > Surely someone is doing this... Say you get data from an HTML page that
> > contains diacritical characters.  You need to store it to Sqlite and
> > retrieve it back out for display.  What do you do???
> >
> > I'm stuck!
> >
> > Doug
> >
>
> --
> View this message in context: 
> http://www.nabble.com/Unicode-Again...-Still-Stuck...-A-Challenge...-Store-and-retrieve-the-word-r%C3%A9sum%C3%A9-without-using-a-unicode-string-literal-tf4190926.html#a11918870
> Sent from the SQLite mailing list archive at Nabble.com.
>
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [EMAIL PROTECTED]
> -----------------------------------------------------------------------------
>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] Unicode Again... Still Stuck... A Challenge... Store and retrieve the word résumé without using a unicode string literal

Reply via email to