On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
<patrick.oliver.glau...@cern.ch> wrote:
> Hi
> Thanks. But I see that 0xd835 is missing in this list (see my exceptions).
>
> What's the best way to get rid of all of them in Python? I am new to unicode 
> in Python but I am sure that this use case is quite frequent.
>

I don't really know python either: so I could be wrong here but are
you just taking these binary .PDF and .DOC files and treating them as
UTF-8 text and sending them to Solr?

If so, I don't think that will work very well. Maybe instead try
parsing these binary files with something like Tika to get at the
actual content and send that? (it seems some people have developed
python integration for this, e.g.
http://redmine.djity.net/projects/pythontika/wiki)

Reply via email to