On Oct 06, Vitaly Pashkov <ad...@fluda.net> wrote:

> You can get my copy here: http://fluda.net/personal/movies.list.gz

It's identical to the one I have.

> imdb=> select id, title from title order by random() limit 5;
>    id    |                  title
> ---------+-----------------------------------------
>  1147146 | (#3.11)
>  1010589 | (2007-05-21)

The above ones are not symptoms of problems: many episodes of
tv series are identified only by their #SEASON.EPISODE number
or by their airing date (when a title is missing, obviously).

> Got it!
  [...]
> Auf der grĂ¼nen Wiese

Not exactly the nasty title I expected - whatever it means. :-)
There's nothing wrong with it, and there are many more umlauts
in that list - this is probably the first occurrence that creates
problems when the data are flushed to the database (the data are
temporarily stored in a Python dictionary, and so they are not
flushed to the db in order).

> /usr/lib/python2.5/site-packages/IMDbPY-4.2-py2.5.egg/EGG-INFO/scripts/imdbpy2sql.py:628:
>  UnicodeWarning: Unicode equal comparison failed to convert both
> arguments to Unicode - interpreting them as being unequal
>   if ts[-1].lower() in _articles:

Try this simple script, to see if it creates any problem:
================================================================
#!/usr/bin/env python

import imdb

utf8_title = 'Auf der gr\xc3\xbcnen Wiese'

print utf8_title in imdb.utils._articles

================================================================

If it replies "False" without raising warnings of exceptions,
I think the problem is not in Python itself but in the psycopg2
module or in the configuration of PostgreSQL.

> BUT! I also tried to run it without my modification at line 1427 and with
> modified title_soundex. It stopped at the other line:
  [...]
> "/usr/lib/python2.5/site-packages/IMDbPY-4.2-py2.5.egg/EGG-INFO/scripts/imdbpy2sql.py",
> line 1030, in _runCommand
>     CURS.executemany(self.sqlstr, self.converter(dataList))
> psycopg2.DataError: byte sequence invalid for encoding "UTF8": 0xc333
> HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".

It's more or less expected: the data is processed without glitches,
but everything explodes when CURS.executemany tries to dump it into
the database.

To summarize how imdbpy2sql.py works: it reads the plain text data
files (which are mostly in iso-8859-1 encoding), convert them to
utf-8 for internal usage (for a series of more or less good reasons)
and uses a cursor provided by the db access module (psycopg2, in this
case) to store them (again, passing the strings as utf-8).

Your change forces imdbpy2sql.py to use _unicode_ representation of
titles; the UnicodeWarning you get is because it compares a unicode
string to a list of utf-8 encoded strings (imdb.utils._articles).
As a temporary solution you can convert _articles to a list of unicodes,
but I can't consider this a real fix.  Put this line somewhere at the
top of the script - hoping it will not break something else ;-) :
  _articles = [x.decode('utf8') for x in _articles]

Why psycopg2 or your PostgreSQL don't play nicely with utf-8 strings
is beyond my imagination. :-)

Another small test (this _could_ exclude psycopg2, even if it can
always be a matter of how it's initialized by SQLObject/SQLAlchemy):

===========================================================
#!/usr/bin/env python

import psycopg2

utf8_title = 'Auf der gr\xc3\xbcnen Wiese'

connection = psycopg2.connect(database='imdb', user='UNAME', password='PWD')
curs = connection.cursor()

curs.execute("INSERT INTO title (title, kind_id) VALUES (%s, 1);",
                (utf8_title,))

connection.commit()

===========================================================

In my installation, it works.

As you can see, debugging these strange interactions between
charsets/encodings and various modules, libraries and database
engines is a real pain. :-(


Thanks for your time!

-- 
Davide Alberani <davide.alber...@gmail.com> [GPG KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Reply via email to