On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote: > On 23/11/11 03:50, Marvin Humphrey wrote: >> How about making this tokenizer implement the word break rules described in >> the Unicode standard annex on Text Segmentation? That's what the Lucene >> StandardTokenizer does (as of 3.1). > > That would certainly be a nice choice for the default tokenizer. It > would be easy to implement with ICU but utf8proc doesn't buy us much > here.
Hmm, that's unfortunate. I think this would be a very nice feature to offer. >> I don't think we need to worry much about making this tokenizer flexible. We >> already offer a certain amount of flexibility via RegexTokenizer. > > Yes, making this tokenizer customizable probably isn't worth the effort. > I'd be happy with a simple tokenizer that extracts \w+ tokens. I can > offer to implement such a tokenizer if it's deemed useful. A straight up \w+ tokenizer wouldn't be optimal for English, at least. It would break on apostrophes, resulting in a large number of solitary 's' tokens thanks to possesives and contractions -- i.e. "maggie's farm" would tokenize as ["maggie", "s", "farm"] instead of ["maggie's", "farm"]. Marvin Humphrey
