Re: StandardFilter that works for French

Joshua O'Madadhain Thu, 21 Nov 2002 13:01:43 -0800

On Thu, 21 Nov 2002, Konrad Scherer wrote:

> In French you have 6 words (me, te, se, le/la , ne, de) where the e is
> replaced with an apostrophe when the following word starts with a vowel.
> For example me aider becomes m'aider. Currently Lucene indexes m'aider,
> s'aider, n'aider as different words when in fact they should be analyzed as
> me aider, se aider, ne aider, etc. So I modified Standard filter to send
> back these words as two words. I had to add a one Token buffer. I toyed
> with modifying StandardTokenizer.jj but I was worried about unintended
> changes in behavior.
>
> This change will not effect English indexing. The only change I can think
> of is that a word like m'lord would be indexed as "me lord". Still it might
> be better to make a French package and add this to a French Filter.


There are a number of contractions in English that could be affected if
you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
hasn't.  (Granted, these are often considered stop words.)  Thus, I think
that your idea of incorporating this change into a French filter, rather
than modifying Standard filter, is a good idea.

Joshua O'Madadhain

  [EMAIL PROTECTED] Per Obscurius....www.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: StandardFilter that works for French

Reply via email to