On 23/11/11 03:50, Marvin Humphrey wrote:
How about making this tokenizer implement the word break rules described in
the Unicode standard annex on Text Segmentation?  That's what the Lucene
StandardTokenizer does (as of 3.1).

That would certainly be a nice choice for the default tokenizer. It would be easy to implement with ICU but utf8proc doesn't buy us much here.

I don't think we need to worry much about making this tokenizer flexible.  We
already offer a certain amount of flexibility via RegexTokenizer.

Yes, making this tokenizer customizable probably isn't worth the effort. I'd be happy with a simple tokenizer that extracts \w+ tokens. I can offer to implement such a tokenizer if it's deemed useful.

Nick

Reply via email to