On Mon, 5 Oct 2009 14:29:04 +0200, Davide Alberani <davide.alber...@gmail.com> wrote: > Two things you can try, now: > 1. put your movies.list.gz somewhere: with the one I've downloaded > yesterday I'm unable to reproduce the problem; I'd like to try yours. I'am downloaded movies.list.gz twice from different mirrors. They are exact. You can get my copy here: http://fluda.net/personal/movies.list.gz
> 2. with your change to line 1427 in place, modify also the title_soundex > function adding at the top these two lines: > print _(title) > sys.stdout.flush() > in your output you should see the "wrong" title right before > the UnicodeWarning warning. > >> and then it continues to importing data. This error occurs only once. > > It starts to sound a lot like garbage in the data... And it looks too :) imdb=> select id, title from title order by random() limit 5; id | title ---------+----------------------------------------- 1147146 | (#3.11) 1010589 | (2007-05-21) 187525 | Frida Kahlo 151019 | Eagles in the Chicken Coop 1119049 | Role of the University in American Life >> Any ideas how i can intercept this UnicodeWarning so i can see at what >> line of >> movie.list it happen? I tried to add next line to readMovieList(): >> print "Title: " + title + ", counter: " + str(count) >> but the output is soooo big... tonns of lines, over9000. > > Try modifying the title_soundex function as said above, and > then run your command appending this: > 2&>1 | tee ~/OUTPUT.txt > > After that, you can easily search for UnicodeWarning in the > ~/OUTPUT.txt file. Got it! ... SCANNING movies: Bullets for Bandits (1942) (movieID: 80001) SCANNING movies: Celebration (2010) (movieID: 90001) * FLUSHING MoviesCache... Auf der grĂ¼nen Wiese /usr/lib/python2.5/site-packages/IMDbPY-4.2-py2.5.egg/EGG-INFO/scripts/imdbpy2sq l.py:628: UnicodeWarning: Unicode equal comparison failed to convert both argume nts to Unicode - interpreting them as being unequal if ts[-1].lower() in _articles: ... BUT! I also tried to run it without my modification at line 1427 and with modified title_soundex. It stopped at the other line: ... Benzina Before Dawn Alone, Unarmed, and Unafraid Traceback (most recent call last): ... [same as in previous emails] File "/usr/lib/python2.5/site-packages/IMDbPY-4.2-py2.5.egg/EGG-INFO/scripts/imdbpy2sql.py", line 1030, in _runCommand CURS.executemany(self.sqlstr, self.converter(dataList)) psycopg2.DataError: byte sequence invalid for encoding "UTF8": 0xc333 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". >> BTW, after that change (decode('utf-8')) DB content is looking good, but >> as it contains almost 1.5M records i can't be sure for 100%. > > As said, doing so I'm not too sure that the database will store the > titles with the right encoding (every database seems to have a > different opinion about how to handle input in unicode and/or utf8 > or other encodings...) Encoding is looking good, but i think this modification brakes line parsing, making trash lines with dates in titles and so on. Anyway, that was a dirty hack that may (or may not) help debugging. ;) ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help