I didn't notice any code that did something like replace accents when I wrote my stuff a few months ago. I think there are language filters that do stuff like stemming, but I'm not sure about removing accents. Removing accents is kind of a personal choice. It makes a lot of sense to replace accents for English language search engines that may have a rare non-English word. I'm not sure if a site with a language with lots of accents would want to do this by default. Maybe, maybe not? How to handle the case where someone does search for the accented word? If you just remove accents and return all non-accented matches, you might get a lot of false hits.
You could package the code as an Analyzer if you like. It might be nice to have a util class, and then an analyzer that uses it. By the way, I'm guessing that the code I wrote only makes sense for Western charsets. It will probably mess up other languages a lot. Howie
Thank you, but shouldn't this be a part of the "analyzer"? Lucene has analyzers that do this by default, why not Nutch? Thanks, Frank. On 2/20/06, Howie Wang <[EMAIL PROTECTED]> wrote: > I threw this code together a while ago and it seems to work for me. > The performance could probably be improved, but > if anyone wants, they're free to check it in. It goes under > src/java/org/apache/nutch/util/AccentReplacer.java. > > Howie > >
