RE: Indexing in Solr: invalid UTF-8

Patrick Oliver Glauner Tue, 25 Sep 2012 11:02:44 -0700

Hi
Thanks. But I see that 0xd835 is missing in this list (see my exceptions).


What's the best way to get rid of all of them in Python? I am new to unicode in 
Python but I am sure that this use case is quite frequent.

Patrick

________________________________________
From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Tuesday, September 25, 2012 7:24 PM
To: solr-user@lucene.apache.org; Patrick Oliver Glauner
Subject: RE: Indexing in Solr: invalid UTF-8

Hi - you need to get rid of all non-character code points.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]


-----Original message-----
> From:Patrick Oliver Glauner <patrick.oliver.glau...@cern.ch>
> Sent: Tue 25-Sep-2012 18:47
> To: solr-user@lucene.apache.org
> Subject: Indexing in Solr: invalid UTF-8
>
> Hello
>
> We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, 
> DOC etc. Our indexing script is written in Python 2.4 using solrpy:
>
> [...]
> text = remove_control_characters(text) # except \r, \t, \n
> utext = unicode(text, 'utf-8')
> SOLR_CONNECTION.add(id=recid, fulltext=utext)
> [...]
>
> But for some fulltexts we still get exceptions like:
>
> * [was class java.io.CharConversionException] Invalid UTF-8 character 
> 0xd835(a surrogate character) at char #1144, byte #127)
> * [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff 
> at char #1427640, byte #1564649)
> * ....
>
> Why does this exceptions still occur? How can I avoid these exceptions? I 
> hoped that utext = unicode(text, 'utf-8') was enough.
>
> Thanks
> Patrick
>
>
> FYI, the fulltext field definition is:
>
> <field name="fulltext" type="invenioText" indexed="true" stored="false" 
> multiValued="true"/>
>
> where
>
>     <fieldType name="invenioText" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>           <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="1"
>                 generateNumberParts="1"
>                 catenateWords="0"
>                 catenateNumbers="0"
>                 catenateAll="1"
>                 preserveOriginal="1"
>                 splitOnNumerics ="1"
>                 splitOnCaseChange ="1"
>                 stemEnglishPossessive="1"
>                 />
>           <filter class="solr.LowerCaseFilterFactory"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>           <filter class="solr.EnglishPorterFilterFactory"/>
>           <filter class="solr.LengthFilterFactory" min="2" max="99"/>
>
>       </analyzer>
>       [...]
>     </fieldType>
>
>
> --
> Patrick GLAUNER [patrick.oliver.glau...@cern.ch]
>
> CERN
> Information Technology Department
> CH-1211 Geneva 23
>

RE: Indexing in Solr: invalid UTF-8

Reply via email to