Re: unicode confusing
On May 25, 6:07 pm, Paul Boddie p...@boddie.org.uk wrote: On 25 Mai, 17:39, someone petshm...@googlemail.com wrote: Hi, reading content of webpage (encoded in utf-8) with urllib2, I can't get parsed data into DB Exception: File /usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py, line 3111, in execute raise OperationalError, msg libpq.OperationalError: ERROR: invalid UTF-8 byte sequence detected near byte 0xe4 I've already checked several python unicode tutorials, but I have no idea how to solve my problem. With pyPgSQL, there are a few tricks that you have to take into account: 1. With PostgreSQL, it would appear advantageous to create databases using the -E unicode option. Hi, DB is in UTF8 2. When connecting, use the client_encoding and unicode_results arguments for the connect function call: connection = PgSQL.connect(client_encoding=utf-8, unicode_results=1) If I do unicode_results=1, then there are exceptions in other places, e.g. urllib.urlencode(values) cant encode values 3. After connecting, it appears necessary to set the client encoding explicitly: connection.cursor().execute(set client_encoding to unicode) I've tried this as well, but still have exceptions I'd appreciate any suggestions which improve on the above, but what this should allow you to do is to present Unicode objects to the database and to receive such objects from queries. Whether you can relax this and pass UTF-8-encoded strings instead of Unicode objects is not something I can guarantee, but it's usually recommended that you manipulate Unicode objects in your program where possible, and here you should be able to let pyPgSQL deal with the encodings preferred by the database. Thanks for your suggestions! Sadly, I can't solve my problem... Pet Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode confusing
On May 26, 9:29 am, Pet petshm...@googlemail.com wrote: On May 25, 6:07 pm, Paul Boddie p...@boddie.org.uk wrote: On 25 Mai, 17:39, someone petshm...@googlemail.com wrote: Hi, reading content of webpage (encoded in utf-8) with urllib2, I can't get parsed data into DB Exception: File /usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py, line 3111, in execute raise OperationalError, msg libpq.OperationalError: ERROR: invalid UTF-8 byte sequence detected near byte 0xe4 I've already checked several python unicode tutorials, but I have no idea how to solve my problem. With pyPgSQL, there are a few tricks that you have to take into account: 1. With PostgreSQL, it would appear advantageous to create databases using the -E unicode option. Hi, DB is in UTF8 2. When connecting, use the client_encoding and unicode_results arguments for the connect function call: connection = PgSQL.connect(client_encoding=utf-8, unicode_results=1) If I do unicode_results=1, then there are exceptions in other places, e.g. urllib.urlencode(values) cant encode values 3. After connecting, it appears necessary to set the client encoding explicitly: connection.cursor().execute(set client_encoding to unicode) I've tried this as well, but still have exceptions I'd appreciate any suggestions which improve on the above, but what this should allow you to do is to present Unicode objects to the database and to receive such objects from queries. Whether you can relax this and pass UTF-8-encoded strings instead of Unicode objects is not something I can guarantee, but it's usually recommended that you manipulate Unicode objects in your program where possible, and here you should be able to let pyPgSQL deal with the encodings preferred by the database. Thanks for your suggestions! Sadly, I can't solve my problem... Pet Paul After some time, I've tried, to convert result with unicode(result, 'ISO-8859-15') and that was it :) I've thought it was already utf-8, because of charset defining in meta of webpage I'm fetching Pet -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode confusing
On 26 Mai, 10:09, Pet petshm...@googlemail.com wrote: After some time, I've tried, to convert result with unicode(result, 'ISO-8859-15') and that was it :) I haven't really investigated having unicode_results set to false (or the default) with a database containing UTF-8 (or any non-ASCII encoded) text, since it's always desirable to manipulate Unicode internally in one's programs: I don't want plain strings containing various encoded sequences of bytes when I'm dealing with characters. That said, if one were consuming XML/HTML and then putting it in raw form into a database (including the tags), I could understand that Unicode objects might then seem like a distraction. I've thought it was already utf-8, because of charset defining in meta of webpage I'm fetching There are lots of caveats about Web page encodings - which metadata actually indicates the encoding - but I still regard the best approach to involve converting text to Unicode as soon as possible, then presenting Unicode objects to the database. This way, you can separate the decisions about which encodings the Web pages are using and which encoding the database is using. Paul -- http://mail.python.org/mailman/listinfo/python-list