Lance Norskog wrote:

ISOLatin1AccentFilterFactory works quite well for us. It solves our basic
euro-text keyboard searching problem, where "protege" should find protégé.
("protege" with two accents.)

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2008 4:05 PM
To: solr-user@lucene.apache.org
Subject: Re: UnicodeNormalizationFilterFactory


: I've seen mention of these filters:
:
:  <filter class="schema.UnicodeNormalizationFilterFactory"/>
:  <filter class="schema.DiacriticsFilterFactory"/>

Are you asking because you saw these in Robert Haschart's reply to your
previous question?  I think those are custom Filters that he has in his
project ... not open source (but i may be wrong)

they are certainly not something that comes out of the box w/ Solr.


-Hoss
The ISOLatin1AccentFilter works well in the case above described by Lance Norskog, ie. for words containing characters with accents where the accented character is a single unicode character for the letter with the accent mark as in protégé. However in the data that we work with, often accented characters will be represented by a plain unaccented character followed by the Unicode combining character for the accent mark, roughly like this: prote'ge' which emerge from the ISOLatin1AccentFilter unchanged.

After some research I found the UnicodeNormalizationFilter mentioned above, which did not work on my development system (because it relies features only available in java 6), and which when combined with the DiacriticsFilter also mentioned above would remove diacritics from characters, but also discard any Chinese characters or Russian characters, or anything else outside the 0x0--0x7f range. Which is bad.

I first modified the filter to normalize the characters to the composed normalized form, (changing prote'ge' to protégé) and then pass the results through the ISOLatin1AccentFilter. However for accented characters for which there is no composed normailzed form (such as the n and s in Zarin̦š) the accents are not removed.

So I took the approach of decomposing the accented characters, and then only removing the valid diacritics and zero-width composing characters from the result, and the resulting filter works quite well. And since it was developed as a part of the blacklight project at the University of Virginia it is Open Source under the Apache License.

If anyone is interested in evaluating of using the UnicodeNormalizationFilter in conjunction with their Solr installation get the UnicodeNormalizeFilter.jar from:

http://blacklight.rubyforge.org/svn/trunk/solr/lib/

and place it in a lib directory next to the conf directory in your Solr home directory.

Robert Haschart







Reply via email to