Re: [sqlite] Unicode Again... Sti ll Stuck... A Challenge... Store and retrieve the word résumé with out using a unicode string literal

wcmadness Mon, 30 Jul 2007 23:48:24 -0700

Well, I have a solution to my own problem, and I wanted to post it for two
reasons: First, it might help someone; second, I'm wondering if someone can
explain it to me...

Here's the scoop...

I'm on a Windows machine.  It turns out that the default code page on
Windows is cp437.  So, in my Python code, if I type:

s = 'résumé' (with the French e s), it is stored as: 'r\x82sum\x82' because
hex 82 (decimal 130) is the code for French e in code page 437 (used by
Windows)...

OK.  So, now that I now my data comes to me from the HTML form (or in a flat
file) in code page 437 on a Windows machine, I can do the following when I
send the data to the database:

f = cgi.FieldStorage()
cur.execute("insert into test values (?,?)",
(f['txtName'].value.decode('cp437')))

The decode method after the incoming form data will force a translation from
code page 437 to unicode (from 1 byte per character according to extended
ascii set code page 437 to 2 bytes per character -- unicode).  That's all
fine.

Now, when I get the data with:

cur.execute("select * from test")
mylist = cur.fetchall()

I would expect that I would need to encode the unicode data coming from
Sqlite to get back to my original code page 437 (of course, I could also
just use the data as unicode).  So, I would expect to do this:

(say that row one, column one has the value of résumé)

In that case, the following should return me exactly to the original
'r\x82sum\x82'

mylist[0][0].encode('cp437')

But it doesn't!!! (Wacky)!

Rather, it gives me this: 'r\xe9sum\xe9'

Interestingly, that's almost the same as what I get with a unicode literal. 
In other words, if I write this Python code:

x = u'résumé'

and then type x in the shell to see what it is, I get this:

u'r\xe9sum\xe9'

The only difference is that the latter is unicode and the former
('r\xe9sum\xe9') is not.

So, to get back where I started, I do the fetchall and then this wacky
thing:

eval("u'" + mylist[0][0].encode('cp437') + "'").encode('cp437')

In other words, I say: OK, you're almost there.  Now, convert to unicode by
evaluating the string as a unicode literal and then encode the unicode back
to the code page 437.

What a kludge.  It seems like an awefully lot of work to get back to the
original data that was stored to the database.  And why?  Does anyone know
what's going on here???

Thanks.

wcmadness wrote:
> 
> Surely there is an answer to this question...
> 
> I'm using Python and PySqlite.  I'm trying to store the word résumé to a
> text field.  I'm really doing this as a test to see how to handle
> diacritical letters, such as umlaut characters (from German) or accented
> characters (from French).  I can produce French é on my keyboard with
> Alt-130...
> 
> If I were coding a string literal, I would send through the data as
> unicode, as in: u'résumé'.  But, I'm not that lucky.  The data is coming
> from an HTML form or from a flat file.  It will take on the default codec
> used on my machine (latin-1).  If I just send it through as is, it has
> problems either when I fetchall or when I try to print what I've fetched. 
> So, for example:
> 
> Insert Into tblTest (word) values ('résumé')
> 
> will cause problems.
> 
> I know that Sqlite stores text data as utf-8.  I know that in Python (on
> my machine, at least) strings are stored as latin-1.  So, for example, in
> Python code:
> 
> v = 'résumé'
> 
> v would be of type str, using latin-1 encoding.
> 
> So, I have tried sending through my data as follows:
> 
> cur.execute("Insert Into tblTest (word) values (?)",
> ("résumé".decode("latin-1").encode("utf-8"),))
> 
> That stores the data just fine, but when I fetchall, I still have
> problems.  Say, I select * from tblTest and then do:
> 
> l = cur.fetchall()
> 
> Doing print l[0][1]  (to print the word résumé) will give a nasty message
> about ascii codec can't convert character \x082 (or some variation of that
> message).
> 
> I've tried doing:
> 
> print l[0][1].decode('utf-8').encode('latin-1')
> 
> But to no avail.
> 
> The simple question is this:
> 
> How do I store the word résumé to a Sqlite DB without using a unicode
> literal (e.g. u'résumé'), such that printing the results retrieved from
> fetchall will not crash?
> 
> Surely someone is doing this... Say you get data from an HTML page that
> contains diacritical characters.  You need to store it to Sqlite and
> retrieve it back out for display.  What do you do???
> 
> I'm stuck!
> 
> Doug
> 

-- 
View this message in context: 
http://www.nabble.com/Unicode-Again...-Still-Stuck...-A-Challenge...-Store-and-retrieve-the-word-r%C3%A9sum%C3%A9-without-using-a-unicode-string-literal-tf4190926.html#a11918870
Sent from the SQLite mailing list archive at Nabble.com.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] Unicode Again... Sti ll Stuck... A Challenge... Store and retrieve the word résumé with out using a unicode string literal

Reply via email to