Currently, the new StandardTokenizer implements the word break algorithm as defined in Unicode Annex #29. One detail of this algorithm is that it defines a set of "MidLetter" and "MidNum" characters which don't break a sequence of letters or numbers. It seems the main reason is to not break around characters like apostrophes or number separators.

While some people might prefer this behavior, I'd like to add second mode of operation that does split on all characters that are not alphanumeric with the exception of underscores. This would very much resemble a RegexTokenizer with a \w+ pattern.

The whole thing could be implemented by simply adding an option to StandardTokenizer so that "MidLetter" and "MidNum" characters are ignored.

Nick

Reply via email to