Hello

I indexed an html document with a decimal HTML Entity encodings: the character 
é (e with an acute accent) is encoded as é The exact content of the 
document is:

<html><body>&#231;a va m&#233;m&#233; ?</body></html>

A search for 'mémé' returns no document. If I put the line above in solr 
admin's analysis.jsp it also doesn't match mémé. There is only a match if I 
replace &#233; by é .

This is how I configured the fieldType:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I tried avoiding the problem by using the MappingCharFilterFactory:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I put the file mapping.txt in the conf directory. It contains just this:

"&#233;" => "é"

This doesn't work either. How can I get this to work?
(I am using solr 1.4.0)

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

Reply via email to