[lucy-dev] Extending the StandardTokenizer

Nick Wellnhofer Mon, 20 Feb 2012 04:53:00 -0800

Currently, the new StandardTokenizer implements the word break algorithmas defined in Unicode Annex #29. One detail of this algorithm is that itdefines a set of "MidLetter" and "MidNum" characters which don't break asequence of letters or numbers. It seems the main reason is to not breakaround characters like apostrophes or number separators.

While some people might prefer this behavior, I'd like to add secondmode of operation that does split on all characters that are notalphanumeric with the exception of underscores. This would very muchresemble a RegexTokenizer with a \w+ pattern.

The whole thing could be implemented by simply adding an option toStandardTokenizer so that "MidLetter" and "MidNum" characters are ignored.


Nick

[lucy-dev] Extending the StandardTokenizer

Reply via email to