On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner <patrick.oliver.glau...@cern.ch> wrote: > Hi > Thanks. But I see that 0xd835 is missing in this list (see my exceptions). > > What's the best way to get rid of all of them in Python? I am new to unicode > in Python but I am sure that this use case is quite frequent. >
I don't really know python either: so I could be wrong here but are you just taking these binary .PDF and .DOC files and treating them as UTF-8 text and sending them to Solr? If so, I don't think that will work very well. Maybe instead try parsing these binary files with something like Tika to get at the actual content and send that? (it seems some people have developed python integration for this, e.g. http://redmine.djity.net/projects/pythontika/wiki)