Python's unicode function takes an optional (keyword) "errors" argument, telling it what to do when an invalid UTF8 byte sequence is seen.
The default (errors='strict') is to throw the exceptions you're seeing. But you can also pass errors='replace' or errors='ignore'. See http://docs.python.org/howto/unicode.html for details ... However, I agree with Robert: you should dig into why whatever process you used to extract the full text from your binary documents is producing invalid UTF-8 ... something is wrong with that process. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir <rcm...@gmail.com> wrote: > On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner > <patrick.oliver.glau...@cern.ch> wrote: >> Hi >> Thanks. But I see that 0xd835 is missing in this list (see my exceptions). >> >> What's the best way to get rid of all of them in Python? I am new to unicode >> in Python but I am sure that this use case is quite frequent. >> > > I don't really know python either: so I could be wrong here but are > you just taking these binary .PDF and .DOC files and treating them as > UTF-8 text and sending them to Solr? > > If so, I don't think that will work very well. Maybe instead try > parsing these binary files with something like Tika to get at the > actual content and send that? (it seems some people have developed > python integration for this, e.g. > http://redmine.djity.net/projects/pythontika/wiki)