RE: Indexing in Solr: invalid UTF-8

2012-10-09 Thread Patrick Oliver Glauner
[char]) +except: +pass +return utext with: +CFG_SOLR_INVALID_CHAR_REPLACEMENTS = { +u'\u' : u, +u'\uD835' : u +} This works well in our production environment. Cheers, Patrick From: Patrick Oliver Glauner

RE: Indexing in Solr: invalid UTF-8

2012-09-28 Thread Patrick Oliver Glauner
, Sep 25, 2012 at 10:44 PM, Robert Muir rcm...@gmail.com wrote: On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner patrick.oliver.glau...@cern.ch wrote: Hi Thanks. But I see that 0xd835 is missing in this list (see my exceptions). What's the best way to get rid of all of them in Python? I am

Indexing in Solr: invalid UTF-8

2012-09-25 Thread Patrick Oliver Glauner
Hello We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, DOC etc. Our indexing script is written in Python 2.4 using solrpy: [...] text = remove_control_characters(text) # except \r, \t, \n utext = unicode(text, 'utf-8') SOLR_CONNECTION.add(id=recid, fulltext=utext)

RE: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Patrick Oliver Glauner
[markus.jel...@openindex.io] Sent: Tuesday, September 25, 2012 7:24 PM To: solr-user@lucene.apache.org; Patrick Oliver Glauner Subject: RE: Indexing in Solr: invalid UTF-8 Hi - you need to get rid of all non-character code points. http://unicode.org/cldr/utility/list-unicodeset.jsp