"Gilles Ganault" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
On Thu, 27 Nov 2008 01:00:28 +0000, MRAB <[EMAIL PROTECTED]>
wrote:
No problem here:

>>> import urllib
>>> data = urllib.urlopen("http://www.amazon.co.jp/";).read()
>>> decoded_data = data.decode("shift-jis")
>>>

This is correct. You should read in the whole page and convert it to Unicode immediately.


Thanks, but it seems like some pages contain ShiftJIS mixed with some
other code page, and Python complains when trying to display this. I
ended up not displaying the string, and just sending it directly to
the database:

========
title = None
m = firsttry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()

You should not search the raw data and decode it later...decode the data when first brought into the program and do all processing in Unicode.

except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
m = secondtry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
print "Nothing found for ISBN %s" % isbn

if title:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
#print "Found : %s" % title
print "Found stuff"

Note here that you are getting an "encode" error. When trying to print the data, Python will try to encode the Unicode data using the terminal's default encoding, which I suspect is not Shift-JIS.

-Mark


sql = 'INSERT INTO books (title) VALUES (?)'
cursor.execute(sql,(title,))
========

Thank you
--
http://mail.python.org/mailman/listinfo/python-list



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to