Hi - you need to get rid of all non-character code points.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
-----Original message-----
> From:Patrick Oliver Glauner <patrick.oliver.glau...@cern.ch>
> Sent: Tue 25-Sep-2012 18:47
> To: solr-user@lucene.apache.org
> Subject: Indexing in Solr: invalid UTF-8
>
> Hello
>
> We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs,
> DOC etc. Our indexing script is written in Python 2.4 using solrpy:
>
> [...]
> text = remove_control_characters(text) # except \r, \t, \n
> utext = unicode(text, 'utf-8')
> SOLR_CONNECTION.add(id=recid, fulltext=utext)
> [...]
>
> But for some fulltexts we still get exceptions like:
>
> * [was class java.io.CharConversionException] Invalid UTF-8 character
> 0xd835(a surrogate character) at char #1144, byte #127)
> * [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff
> at char #1427640, byte #1564649)
> * ....
>
> Why does this exceptions still occur? How can I avoid these exceptions? I
> hoped that utext = unicode(text, 'utf-8') was enough.
>
> Thanks
> Patrick
>
>
> FYI, the fulltext field definition is:
>
> <field name="fulltext" type="invenioText" indexed="true" stored="false"
> multiValued="true"/>
>
> where
>
> <fieldType name="invenioText" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="1"
> preserveOriginal="1"
> splitOnNumerics ="1"
> splitOnCaseChange ="1"
> stemEnglishPossessive="1"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"/>
> <filter class="solr.LengthFilterFactory" min="2" max="99"/>
>
> </analyzer>
> [...]
> </fieldType>
>
>
> --
> Patrick GLAUNER [patrick.oliver.glau...@cern.ch]
>
> CERN
> Information Technology Department
> CH-1211 Geneva 23
>