Python's unicode function takes an optional (keyword) "errors"
argument, telling it what to do when an invalid UTF8 byte sequence is

The default (errors='strict') is to throw the exceptions you're
seeing.  But you can also pass errors='replace' or errors='ignore'.

See for details ...

However, I agree with Robert: you should dig into why whatever process
you used to extract the full text from your binary documents is
producing invalid UTF-8 ... something is wrong with that process.

Mike McCandless

On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir <> wrote:
> On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
> <> wrote:
>> Hi
>> Thanks. But I see that 0xd835 is missing in this list (see my exceptions).
>> What's the best way to get rid of all of them in Python? I am new to unicode 
>> in Python but I am sure that this use case is quite frequent.
> I don't really know python either: so I could be wrong here but are
> you just taking these binary .PDF and .DOC files and treating them as
> UTF-8 text and sending them to Solr?
> If so, I don't think that will work very well. Maybe instead try
> parsing these binary files with something like Tika to get at the
> actual content and send that? (it seems some people have developed
> python integration for this, e.g.

Reply via email to