Re: Accentuated characters

Joshua O'Madadhain Tue, 10 Dec 2002 12:52:16 -0800

On Tue, 10 Dec 2002, stephane vaucher wrote:

> I wish to implement a TokenFilter that will remove accentuated
> characters so for example 'é' will become 'e'. As I would rather not
> reinvent the wheel, I've tried to find something on the web and on the
> mailing lists. I saw a mention of a contrib that could do this (see
> http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html),
> but I don't see anything applicable.


It may depend on what kind of encoding you're working with.  (E.g., HTML
documents represent such characters in a different way than that of
Postscript documents.)  Probably the easiest way to handle this, if you
want to avoid such questions, would be to convert all your input documents
(and query text) to Java (Unicode) strings, and then do a
search-and-replace with the appropriate character-pair arguments.  (After
this is done, you would then do whatever Lucene processing (indexing,
query parsing, etc.) was appropriate.  I am not aware of any code that
does this, but it should be straightforward.
 
Good luck--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Accentuated characters

Reply via email to