RE: UnicodeNormalizationFilterFactory

Steven A Rowe Mon, 30 Jun 2008 10:19:34 -0700

Hi Robert,

Could you create a JIRA issue and attach your code to it?  That makes it easier 
for people to evaluate it (rather than just binary distribution).


This sounds general enough to me that it would be a useful addition to Lucene 
itself.  Solr's factory could just be sugar on top then.

Thanks,
Steve

On 06/26/2008 at 4:41 PM, Robert Haschart wrote:
> Lance Norskog wrote:
> 
> > ISOLatin1AccentFilterFactory works quite well for us. It solves our
> > basic euro-text keyboard searching problem, where "protege" should find
> > protégé. ("protege" with two accents.)
> > 
> > -----Original Message-----
> > From: Chris Hostetter [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, June 24, 2008 4:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: UnicodeNormalizationFilterFactory
> > 
> > 
> > > I've seen mention of these filters:
> > > 
> > >  <filter class="schema.UnicodeNormalizationFilterFactory"/>
> > >  <filter class="schema.DiacriticsFilterFactory"/>
> > 
> > Are you asking because you saw these in Robert Haschart's reply to your
> > previous question?  I think those are custom Filters that he has in his
> > project ... not open source (but i may be wrong)
> > 
> > they are certainly not something that comes out of the box w/ Solr.
> > 
> > 
> > -Hoss
> > 
> > 
> The ISOLatin1AccentFilter works well in the case above described by
> Lance Norskog, ie. for words containing characters with accents where
> the accented character is a single unicode character for the
> letter with
> the accent mark as in protégé. However in the data that we work with,
> often accented characters will be represented by a plain unaccented
> character followed by the Unicode combining character for the accent
> mark, roughly like this: prote'ge' which emerge from the
> ISOLatin1AccentFilter unchanged.
> 
> After some research I found the UnicodeNormalizationFilter mentioned
> above, which did not work on my development system (because it relies
> features only available in java 6), and which when combined with the
> DiacriticsFilter also mentioned above would remove diacritics from
> characters, but also discard any Chinese characters or Russian
> characters, or anything else outside the 0x0--0x7f range.
> Which is bad.
> 
> I first modified the filter to normalize the characters to
> the composed
> normalized form, (changing prote'ge' to protégé) and then pass the
> results through the ISOLatin1AccentFilter. However for accented
> characters for which there is no composed normailzed form
> (such as the n
> and s in Zarin̦š) the accents are not removed.
> 
> So I took the approach of decomposing the accented characters, and then
> only removing the valid diacritics and zero-width composing characters
> from the result, and the resulting filter works quite well. And since it
> was developed as a part of the blacklight project at the University of
> Virginia it is Open Source under the Apache License.
> 
> If anyone is interested in evaluating of using the
> UnicodeNormalizationFilter in conjunction with their Solr installation
> get the UnicodeNormalizeFilter.jar from:
> 
> http://blacklight.rubyforge.org/svn/trunk/solr/lib/
> 
> and place it in a lib directory next to the conf directory in
> your Solr
> home directory.
> 
> Robert Haschart
> 
> 
> 
> 
> 
> 
> 
>

RE: UnicodeNormalizationFilterFactory

Reply via email to