Thank you. I will check our textification process and see how to improve it.

Patrick


________________________________________
From: Michael McCandless [luc...@mikemccandless.com]
Sent: Wednesday, September 26, 2012 5:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing in Solr: invalid UTF-8

Python's unicode function takes an optional (keyword) "errors"
argument, telling it what to do when an invalid UTF8 byte sequence is
seen.

The default (errors='strict') is to throw the exceptions you're
seeing.  But you can also pass errors='replace' or errors='ignore'.

See http://docs.python.org/howto/unicode.html for details ...

However, I agree with Robert: you should dig into why whatever process
you used to extract the full text from your binary documents is
producing invalid UTF-8 ... something is wrong with that process.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir <rcm...@gmail.com> wrote:
> On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
> <patrick.oliver.glau...@cern.ch> wrote:
>> Hi
>> Thanks. But I see that 0xd835 is missing in this list (see my exceptions).
>>
>> What's the best way to get rid of all of them in Python? I am new to unicode 
>> in Python but I am sure that this use case is quite frequent.
>>
>
> I don't really know python either: so I could be wrong here but are
> you just taking these binary .PDF and .DOC files and treating them as
> UTF-8 text and sending them to Solr?
>
> If so, I don't think that will work very well. Maybe instead try
> parsing these binary files with something like Tika to get at the
> actual content and send that? (it seems some people have developed
> python integration for this, e.g.
> http://redmine.djity.net/projects/pythontika/wiki)

Reply via email to