Hello
We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs,
DOC etc. Our indexing script is written in Python 2.4 using solrpy:
[...]
text = remove_control_characters(text) # except \r, \t, \n
utext = unicode(text, 'utf-8')
SOLR_CONNECTION.add(id=recid, fulltext=utext)
[...]
But for some fulltexts we still get exceptions like:
* [was class java.io.CharConversionException] Invalid UTF-8 character 0xd835(a
surrogate character) at char #1144, byte #127)
* [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at
char #1427640, byte #1564649)
* ....
Why does this exceptions still occur? How can I avoid these exceptions? I hoped
that utext = unicode(text, 'utf-8') was enough.
Thanks
Patrick
FYI, the fulltext field definition is:
<field name="fulltext" type="invenioText" indexed="true" stored="false"
multiValued="true"/>
where
<fieldType name="invenioText" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="1"
preserveOriginal="1"
splitOnNumerics ="1"
splitOnCaseChange ="1"
stemEnglishPossessive="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
<filter class="solr.LengthFilterFactory" min="2" max="99"/>
</analyzer>
[...]
</fieldType>
--
Patrick GLAUNER [[email protected]]
CERN
Information Technology Department
CH-1211 Geneva 23