Re: UnicodeNormalizationFilterFactory

Robert Haschart Thu, 26 Jun 2008 13:42:05 -0700

Lance Norskog wrote:

ISOLatin1AccentFilterFactory works quite well for us. It solves our basic
euro-text keyboard searching problem, where "protege" should find protégé.
("protege" with two accents.)

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2008 4:05 PM
To: solr-user@lucene.apache.org
Subject: Re: UnicodeNormalizationFilterFactory

: I've seen mention of these filters:
:
:  <filter class="schema.UnicodeNormalizationFilterFactory"/>
:  <filter class="schema.DiacriticsFilterFactory"/>

Are you asking because you saw these in Robert Haschart's reply to your
previous question?  I think those are custom Filters that he has in his
project ... not open source (but i may be wrong)

they are certainly not something that comes out of the box w/ Solr.

-Hoss

The ISOLatin1AccentFilter works well in the case above described byLance Norskog, ie. for words containing characters with accents wherethe accented character is a single unicode character for the letter withthe accent mark as in protégé. However in the data that we work with,often accented characters will be represented by a plain unaccentedcharacter followed by the Unicode combining character for the accentmark, roughly like this: prote'ge' which emerge from theISOLatin1AccentFilter unchanged.

After some research I found the UnicodeNormalizationFilter mentionedabove, which did not work on my development system (because it reliesfeatures only available in java 6), and which when combined with theDiacriticsFilter also mentioned above would remove diacritics fromcharacters, but also discard any Chinese characters or Russiancharacters, or anything else outside the 0x0--0x7f range. Which is bad.

I first modified the filter to normalize the characters to the composednormalized form, (changing prote'ge' to protégé) and then pass theresults through the ISOLatin1AccentFilter. However for accentedcharacters for which there is no composed normailzed form (such as the nand s in Zarin̦š) the accents are not removed.

So I took the approach of decomposing the accented characters, and thenonly removing the valid diacritics and zero-width composing charactersfrom the result, and the resulting filter works quite well. And since itwas developed as a part of the blacklight project at the University ofVirginia it is Open Source under the Apache License.

If anyone is interested in evaluating of using theUnicodeNormalizationFilter in conjunction with their Solr installationget the UnicodeNormalizeFilter.jar from:


http://blacklight.rubyforge.org/svn/trunk/solr/lib/

and place it in a lib directory next to the conf directory in your Solrhome directory.


Robert Haschart

Re: UnicodeNormalizationFilterFactory

Reply via email to