On Tue, 10 Dec 2002, stephane vaucher wrote: > I wish to implement a TokenFilter that will remove accentuated > characters so for example 'é' will become 'e'. As I would rather not > reinvent the wheel, I've tried to find something on the web and on the > mailing lists. I saw a mention of a contrib that could do this (see > http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), > but I don't see anything applicable.
It may depend on what kind of encoding you're working with. (E.g., HTML documents represent such characters in a different way than that of Postscript documents.) Probably the easiest way to handle this, if you want to avoid such questions, would be to convert all your input documents (and query text) to Java (Unicode) strings, and then do a search-and-replace with the appropriate character-pair arguments. (After this is done, you would then do whatever Lucene processing (indexing, query parsing, etc.) was appropriate. I am not aware of any code that does this, but it should be straightforward. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>