Indexing in Solr: invalid UTF-8

Patrick Oliver Glauner Tue, 25 Sep 2012 09:43:42 -0700

Hello

We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, 
DOC etc. Our indexing script is written in Python 2.4 using solrpy:


[...]
text = remove_control_characters(text) # except \r, \t, \n
utext = unicode(text, 'utf-8')
SOLR_CONNECTION.add(id=recid, fulltext=utext)
[...]

But for some fulltexts we still get exceptions like:

* [was class java.io.CharConversionException] Invalid UTF-8 character 0xd835(a 
surrogate character) at char #1144, byte #127)
* [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at 
char #1427640, byte #1564649)
* ....

Why does this exceptions still occur? How can I avoid these exceptions? I hoped 
that utext = unicode(text, 'utf-8') was enough.

Thanks
Patrick


FYI, the fulltext field definition is:

<field name="fulltext" type="invenioText" indexed="true" stored="false" 
multiValued="true"/>

where

    <fieldType name="invenioText" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="1"
                preserveOriginal="1"
                splitOnNumerics ="1"
                splitOnCaseChange ="1"
                stemEnglishPossessive="1"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
          <filter class="solr.LengthFilterFactory" min="2" max="99"/>

      </analyzer>
      [...]
    </fieldType>


--
Patrick GLAUNER [[email protected]]

CERN
Information Technology Department
CH-1211 Geneva 23

Indexing in Solr: invalid UTF-8

Reply via email to