RE: Indexing in Solr: invalid UTF-8

Markus Jelsma Tue, 25 Sep 2012 10:21:28 -0700
Hi - you need to get rid of all non-character code points.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
 
 
-----Original message-----
> From:Patrick Oliver Glauner <patrick.oliver.glau...@cern.ch>
> Sent: Tue 25-Sep-2012 18:47
> To: solr-user@lucene.apache.org
> Subject: Indexing in Solr: invalid UTF-8
> 
> Hello
> 
> We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, 
> DOC etc. Our indexing script is written in Python 2.4 using solrpy:
> 
> [...]
> text = remove_control_characters(text) # except \r, \t, \n
> utext = unicode(text, 'utf-8')
> SOLR_CONNECTION.add(id=recid, fulltext=utext)
> [...]
> 
> But for some fulltexts we still get exceptions like:
> 
> * [was class java.io.CharConversionException] Invalid UTF-8 character 
> 0xd835(a surrogate character) at char #1144, byte #127)
> * [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff 
> at char #1427640, byte #1564649)
> * ....
> 
> Why does this exceptions still occur? How can I avoid these exceptions? I 
> hoped that utext = unicode(text, 'utf-8') was enough.
> 
> Thanks
> Patrick
> 
> 
> FYI, the fulltext field definition is:
> 
> <field name="fulltext" type="invenioText" indexed="true" stored="false" 
> multiValued="true"/>
> 
> where
> 
>     <fieldType name="invenioText" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>           <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="0"
>                 catenateNumbers="0"
>                 catenateAll="1"
>                 preserveOriginal="1"
>                 splitOnNumerics ="1"
>                 splitOnCaseChange ="1"
>                 stemEnglishPossessive="1"
>                 />
>           <filter class="solr.LowerCaseFilterFactory"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>           <filter class="solr.EnglishPorterFilterFactory"/>
>           <filter class="solr.LengthFilterFactory" min="2" max="99"/>
> 
>       </analyzer>
>       [...]
>     </fieldType>
> 
> 
> --
> Patrick GLAUNER [patrick.oliver.glau...@cern.ch]
> 
> CERN
> Information Technology Department
> CH-1211 Geneva 23
>
RE: Indexing in Solr: invalid UTF-8

Reply via email to