On 8/28/2010 7:59 PM, Shawn Heisey wrote:
The only drop in term quality that I noticed was that possessive words (apostrophe-s) no longer have the original preserved. I haven't yet decided whether that's a problem.

I finally did notice another drop in term quality from the dual pass - words with punctuation in the middle (like wolf-biederman) are not preserved with that punctuation intact. I need a different filter to strip non-alphanumerics from the beginning and end of terms, that gets run after the tokenizer and the ASCII folding filter but before the word delimeter filter. Does such a thing already exist, or do I just need to use something that does regex? Are there any recommended regex patterns out there for this?

Thanks,
Shawn

Reply via email to