Hello We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, DOC etc. Our indexing script is written in Python 2.4 using solrpy:
[...] text = remove_control_characters(text) # except \r, \t, \n utext = unicode(text, 'utf-8') SOLR_CONNECTION.add(id=recid, fulltext=utext) [...] But for some fulltexts we still get exceptions like: * [was class java.io.CharConversionException] Invalid UTF-8 character 0xd835(a surrogate character) at char #1144, byte #127) * [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1427640, byte #1564649) * .... Why does this exceptions still occur? How can I avoid these exceptions? I hoped that utext = unicode(text, 'utf-8') was enough. Thanks Patrick FYI, the fulltext field definition is: <field name="fulltext" type="invenioText" indexed="true" stored="false" multiValued="true"/> where <fieldType name="invenioText" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" preserveOriginal="1" splitOnNumerics ="1" splitOnCaseChange ="1" stemEnglishPossessive="1" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory"/> <filter class="solr.LengthFilterFactory" min="2" max="99"/> </analyzer> [...] </fieldType> -- Patrick GLAUNER [patrick.oliver.glau...@cern.ch] CERN Information Technology Department CH-1211 Geneva 23