I didn't notice any code that did something like replace accents
when I wrote my stuff a few months ago. I think there are
language filters that do stuff like stemming, but I'm not sure
about removing accents. Removing accents is kind of a personal
choice. It makes a lot of sense to replace accents for English
language search engines that may have a rare non-English word.
I'm not sure if a site with a language with lots of accents would
want to do this by default. Maybe, maybe not? How to handle
the case where someone does search for the accented word?
If you just remove accents and return all non-accented matches,
you might get a lot of false hits.

You could package the code as an Analyzer if you like. It might
be nice to have a util class, and then an analyzer that uses it.
By the way, I'm guessing that the code I wrote only makes sense for
Western charsets. It will probably mess up other languages a lot.

Howie

Thank you, but shouldn't this be a part of the "analyzer"?
Lucene has analyzers that do this by default, why not Nutch?
Thanks,
Frank.

On 2/20/06, Howie Wang <[EMAIL PROTECTED]> wrote:
> I threw this code together a while ago and it seems to work for me.
> The performance could probably be improved, but
> if anyone wants, they're free to check it in. It goes under
> src/java/org/apache/nutch/util/AccentReplacer.java.
>
> Howie
>
>



Reply via email to