Hello all,

I am using Lucene to index both English and French documents and have run into some problems with the analysis of the text. The project I am working with is using the searches to do language analysis so this may not be relevant to some people. Here is a quick explanation.

In French you have 6 words (me, te, se, le/la , ne, de) where the e is replaced with an apostrophe when the following word starts with a vowel. For example me aider becomes m'aider. Currently Lucene indexes m'aider, s'aider, n'aider as different words when in fact they should be analyzed as me aider, se aider, ne aider, etc. So I modified Standard filter to send back these words as two words. I had to add a one Token buffer. I toyed with modifying StandardTokenizer.jj but I was worried about unintended changes in behavior.

This change will not effect English indexing. The only change I can think of is that a word like m'lord would be indexed as "me lord". Still it might be better to make a French package and add this to a French Filter.

I hope this is useful to anyone working with French.
All the best.

Konrad

Attachment: StandardFilter.java.diff
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


Reply via email to