Re: HTML decoder is splitting tokens

Koji Sekiguchi Wed, 26 Aug 2009 08:07:37 -0700

Hi Anders,

Sorry, I don't know this is a bug or a feature, but
I'd like to show an alternate way if you'd like.


In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
marked as deprecated. Instead, HTMLStripCharFilterFactory and
an arbitrary TokenizerFactory are encouraged to use.
And I'd recommend you to use MappingCharFilterFactory
to convert character references to real characters.
That is, you have:

<fieldType name="textHtml" class="solr.TextField" >
 <analyzer>

<charFilter class="solr.MappingCharFilterFactory"mapping="mapping.txt"/>

   <charFilter class="solr.HTMLStripCharFilterFactory"/>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 </analyzer>
</fieldType>

where the contents of mapping.txt:

"&uuml;" => "ü"
"&auml;" => "ä"
"&iuml;" => "ï"
"&euml;" => "ë"
"&ouml;" => "ö"
   :             :

Then run analysis.jsp and see the result.

Thank you,

Koji


Anders Melchiorsen wrote:

Hi.

When indexing the string "G&uuml;nther" with
HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two tokens,
"Gü" and "nther".

Is this a bug, or am I doing something wrong?

(Using a Solr nightly from 2009-05-29)


Anders.

Re: HTML decoder is splitting tokens

Reply via email to