Right now Lucene has an accent filter (ISOLatin1AccentFilter) that remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it planned to add such a filter (which would be very useful, as ISOLatin1AccentFilter isn't able to remove some complex accents on some languages encoded in UTF8. I would paste examples but I'm not sure that they would display correctly).? I think I saw a post long ago on this mailing list about something like that, but it has never been released officially.

See

2001, first post about utf8 accents: http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648 2004, a good solution, but still incomplete : http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792 2006, best attempt yet, but sadly undelivered : http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142

I think Lucene would benefit from a complete UTF8 accents remover... right now the best solution I have is to process everything in PHP before indexing and at query time (and its a little slow).

Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to