: It looks like a very promising approach for us. I'm going to implement : an custom Tokeniser based on your suggestions and see how it goes. Thank : you all for your comments!
you don't really need a custom tokenizer -- just a buffered TokenFilter that clones the original token if it contains accent chars, mutates the clone, and then emits it next with a positionIncrement of 0. i'm kind of suprised ISOLatin1AccentFilter doesn't have an option to do this already -- it would certianly be a worthy patch to commit if someone wants to submit it back to lucene-java. : > don't match the accents exactly they won't get any hits: e.g. if a word : > contains two accented characters and the user only accents one of them in : > their query, they won't match the accented or the unaccented version. this could be accounted for by generating all of the permuations of unaccented characters when indexing -- it wouldn't solve the problem of a source term containing only one accent and the user quering with only one accent but on a different character ... you could work arround this by puting all of the permutations in at index time, but querying on the exact term and the no-accent term at query time. -Hoss