Re: [lucy-dev] Implementing a tokenizer in core

Nick Wellnhofer Wed, 23 Nov 2011 13:54:34 -0800

On 23/11/11 03:50, Marvin Humphrey wrote:

How about making this tokenizer implement the word break rules described in
the Unicode standard annex on Text Segmentation?  That's what the Lucene
StandardTokenizer does (as of 3.1).

That would certainly be a nice choice for the default tokenizer. Itwould be easy to implement with ICU but utf8proc doesn't buy us much here.

I don't think we need to worry much about making this tokenizer flexible.  We
already offer a certain amount of flexibility via RegexTokenizer.

Yes, making this tokenizer customizable probably isn't worth the effort.I'd be happy with a simple tokenizer that extracts \w+ tokens. I canoffer to implement such a tokenizer if it's deemed useful.


Nick

Re: [lucy-dev] Implementing a tokenizer in core

Reply via email to