On Thu, 21 Nov 2002, Konrad Scherer wrote: > In French you have 6 words (me, te, se, le/la , ne, de) where the e is > replaced with an apostrophe when the following word starts with a vowel. > For example me aider becomes m'aider. Currently Lucene indexes m'aider, > s'aider, n'aider as different words when in fact they should be analyzed as > me aider, se aider, ne aider, etc. So I modified Standard filter to send > back these words as two words. I had to add a one Token buffer. I toyed > with modifying StandardTokenizer.jj but I was worried about unintended > changes in behavior. > > This change will not effect English indexing. The only change I can think > of is that a word like m'lord would be indexed as "me lord". Still it might > be better to make a French package and add this to a French Filter.
There are a number of contractions in English that could be affected if you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's, hasn't. (Granted, these are often considered stop words.) Thus, I think that your idea of incorporating this change into a French filter, rather than modifying Standard filter, is a good idea. Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius....www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for. -- Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>