There are a number of contractions in English that could be affected if
you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
hasn't.  (Granted, these are often considered stop words.)  Thus, I think
that your idea of incorporating this change into a French filter, rather
than modifying Standard filter, is a good idea.
Sorry I forgot to mention that it only looks at words where the apostrophe occurs in the second letter and only for words that start with the six magic letters m,t,s,l,n,d . If filtering the very English specific 's and 'S possessives is good enough for the StandardFilter then why not French as well? In the comments of StandardTokenizer.jj we have "This should be a good tokenizer for most European-language documents". Most people will use this one, why not have it work as well as possible? The standard tokenizer is very english centric and the code I posted was for those who may not be aware of it. I work with a lot of bilingual documents (english and french) and my case, this filter improves the quality of the index.
More philosophically, there probably shouldn't even be a "standard" analyzer, just language specific ones.
All the best

Konrad


--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>



Reply via email to