Hi Thanks. But I see that 0xd835 is missing in this list (see my exceptions).
What's the best way to get rid of all of them in Python? I am new to unicode in Python but I am sure that this use case is quite frequent. Patrick ________________________________________ From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, September 25, 2012 7:24 PM To: solr-user@lucene.apache.org; Patrick Oliver Glauner Subject: RE: Indexing in Solr: invalid UTF-8 Hi - you need to get rid of all non-character code points. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] -----Original message----- > From:Patrick Oliver Glauner <patrick.oliver.glau...@cern.ch> > Sent: Tue 25-Sep-2012 18:47 > To: solr-user@lucene.apache.org > Subject: Indexing in Solr: invalid UTF-8 > > Hello > > We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, > DOC etc. Our indexing script is written in Python 2.4 using solrpy: > > [...] > text = remove_control_characters(text) # except \r, \t, \n > utext = unicode(text, 'utf-8') > SOLR_CONNECTION.add(id=recid, fulltext=utext) > [...] > > But for some fulltexts we still get exceptions like: > > * [was class java.io.CharConversionException] Invalid UTF-8 character > 0xd835(a surrogate character) at char #1144, byte #127) > * [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff > at char #1427640, byte #1564649) > * .... > > Why does this exceptions still occur? How can I avoid these exceptions? I > hoped that utext = unicode(text, 'utf-8') was enough. > > Thanks > Patrick > > > FYI, the fulltext field definition is: > > <field name="fulltext" type="invenioText" indexed="true" stored="false" > multiValued="true"/> > > where > > <fieldType name="invenioText" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" > generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" > catenateAll="1" > preserveOriginal="1" > splitOnNumerics ="1" > splitOnCaseChange ="1" > stemEnglishPossessive="1" > /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory"/> > <filter class="solr.LengthFilterFactory" min="2" max="99"/> > > </analyzer> > [...] > </fieldType> > > > -- > Patrick GLAUNER [patrick.oliver.glau...@cern.ch] > > CERN > Information Technology Department > CH-1211 Geneva 23 >