Re: [Imdbpy-help] imdbpy2sql.py, PostgreSQL, UTF8, invalid byte sequence

Vitaly Pashkov Tue, 06 Oct 2009 02:52:57 -0700

On Mon, 5 Oct 2009 14:29:04 +0200, Davide Alberani
<davide.alber...@gmail.com> wrote:
> Two things you can try, now:
> 1. put your movies.list.gz somewhere: with the one I've downloaded
>    yesterday I'm unable to reproduce the problem; I'd like to try yours.
I'am downloaded movies.list.gz twice from different mirrors. They are
exact. You can get my copy here: http://fluda.net/personal/movies.list.gz


> 2. with your change to line 1427 in place, modify also the title_soundex
>    function adding at the top these two lines:
>      print _(title)
>      sys.stdout.flush()
>    in your output you should see the "wrong" title right before
>    the UnicodeWarning warning.
> 
>> and then it continues to importing data. This error occurs only once.
> 
> It starts to sound a lot like garbage in the data...
And it looks too :)
imdb=> select id, title from title order by random() limit 5;
   id    |                  title
---------+-----------------------------------------
 1147146 | (#3.11)
 1010589 | (2007-05-21)
  187525 | Frida Kahlo
  151019 | Eagles in the Chicken Coop
 1119049 | Role of the University in American Life

>> Any ideas how i can intercept this UnicodeWarning so i can see at what
>> line of
>> movie.list it happen? I tried to add next line to readMovieList():
>> print "Title: " + title + ", counter: " + str(count)
>> but the output is soooo big... tonns of lines, over9000.
> 
> Try modifying the title_soundex function as said above, and
> then run your command appending this:
>   2&>1 | tee ~/OUTPUT.txt
> 
> After that, you can easily search for UnicodeWarning in the
> ~/OUTPUT.txt file.
Got it!
...
SCANNING movies: Bullets for Bandits (1942) (movieID: 80001)
SCANNING movies: Celebration (2010) (movieID: 90001)
 * FLUSHING MoviesCache...
Auf der grünen Wiese
/usr/lib/python2.5/site-packages/IMDbPY-4.2-py2.5.egg/EGG-INFO/scripts/imdbpy2sq
l.py:628: UnicodeWarning: Unicode equal comparison failed to convert both
argume
nts to Unicode - interpreting them as being unequal
  if ts[-1].lower() in _articles:
...

BUT! I also tried to run it without my modification at line 1427 and with
modified title_soundex. It stopped at the other line:
...
Benzina
Before Dawn
Alone, Unarmed, and Unafraid
Traceback (most recent call last):
... [same as in previous emails]
File
"/usr/lib/python2.5/site-packages/IMDbPY-4.2-py2.5.egg/EGG-INFO/scripts/imdbpy2sql.py",
line 1030, in _runCommand
    CURS.executemany(self.sqlstr, self.converter(dataList))
psycopg2.DataError: byte sequence invalid for encoding "UTF8": 0xc333
HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".

>> BTW, after that change (decode('utf-8')) DB content is looking good,
but
>> as it contains almost 1.5M records i can't be sure for 100%.
> 
> As said, doing so I'm not too sure that the database will store the
> titles with the right encoding (every database seems to have a
> different opinion about how to handle input in unicode and/or utf8
> or other encodings...)
Encoding is looking good, but i think this modification brakes line
parsing, making trash lines with dates in titles and so on. Anyway, that
was a dirty hack that may (or may not) help debugging. ;)

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql.py, PostgreSQL, UTF8, invalid byte sequence

Reply via email to