Maybe someone knows some fast dirty fix at least how to skip such invalid
byte sequence strings while there are no official fix, so i can finish the
import?
Can we detect invalid byte characters? Maybe we can somehow replace or get
rid of *0xc320* character, which mostly is appearing. Or skip these rows.

Ananlyzed error a bit more. Mostly these errors occur in Japanese actors
(actors.list), in filmography there apperars strange characters:

Hayakawa, Yuzo

Burai hij*8)*
*
*

Tried to delete these rows manually, but the are too much of them :/
Thank you.


On Wed, Apr 13, 2011 at 9:45 AM, darklow <dark...@gmail.com> wrote:

> Since i am not familiar with python, maybe you could suggest some fast fix
> so that scripts doesn't hangs?
> Maybe this helps: In PHP we have perfeclty same error with encoding when
> importing some wrong decoded data. When we have no control over data and we
> cant all the time do utf8_encode since it could encode string twice - to
> bypass this error i use this function which at least prevents from
> postgresql error:
>
> function  fix_encoding($in_str) {
>         $cur_encoding = mb_detect_encoding($in_str) ;
>         if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8")){
>             return $in_str;
>         }else{
>             return utf8_encode($in_str);
>         }
> }
>
> Maybe you can help to adapt this function to Python if similar functions
> are available so we can use it as a quick fix?
> Thanks a lot.
>
> On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani <
> davide.alber...@gmail.com> wrote:
>
>> On Mon, Apr 11, 2011 at 18:35, darklow <dark...@gmail.com> wrote:
>> >
>> >   File "./imdbpy2sql.py", line 1194, in _toDB
>> >     CURS.executemany(self.sqlstr, self.converter(l))
>> > psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320
>> > HINT:  This error can also happen if the byte sequence does not match
>> the
>> > encoding expected by the server, which is controlled by
>> "client_encoding".
>> >
>> > Any suggestions? I found similar topic, but there were also no
>> solutions.
>>
>> Yes, I've had other reports about this bug.
>> Seems to be related to some garbage in the actors.list.gz file.
>> I hope to have time to investigate the problem within a week or two.
>>
>> Thanks for the bug report!
>>
>> --
>> Davide Alberani <davide.alber...@gmail.com>  [PGP KeyID: 0x465BFD47]
>> http://www.mimante.net/
>>
>
>
------------------------------------------------------------------------------
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Reply via email to