Re: [CODE4LIB] indexing word documents using solr [diacritics]

Karl Holten Thu, 12 Feb 2015 14:30:07 -0800

Ah, the wonderful world of character encoding...

To quote the Solr wiki:
There are no known bugs with Solr's character handling, but there have been 
some reported issues with the way different application servers (and different 
versions of the same application server) treat incoming and outgoing multibyte 
characters. In particular, people have reported better success with Tomcat than 
with Jetty... 
(https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F )


I'd probably start by enabling UTF-8 in Tomcat/Jetty and see if that resolves 
the issue. 

If not, I'd check the original files to see what its character encoding is, and 
then check each application that handles the documents to make sure it's using 
that encoding. It might be that the original isn't in UTF-8, or if it is, that 
somewhere along the way the parser, the perl interface, or some other unknown 
culprit is attempting to change it.

Regards,
Karl Holten
Systems Integration Specialist
SWITCH Inc
414-382-6711

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Eric 
Lease Morgan
Sent: Thursday, February 12, 2015 2:38 PM
To: [email protected]
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics]

How do I retain diacritics in a Solr index, and how to I search for words 
containing them?

I have extracted the plain text out of set of Word documents. I have then used 
a Perl interface (WebService::Solr) to add the plain text to a Solr index using 
a field type called text_general:

    <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

It seems as if I am unable to search for words like ejecución because the 
diacritic gets in the way. What am I doing wrong?

— 
Eric

Re: [CODE4LIB] indexing word documents using solr [diacritics]

Reply via email to