Re: [lucy-dev] Implementing a tokenizer in core

Marvin Humphrey Thu, 24 Nov 2011 13:46:42 -0800

On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:
> On 23/11/11 03:50, Marvin Humphrey wrote:
>> How about making this tokenizer implement the word break rules described in
>> the Unicode standard annex on Text Segmentation?  That's what the Lucene
>> StandardTokenizer does (as of 3.1).
>
> That would certainly be a nice choice for the default tokenizer. It  
> would be easy to implement with ICU but utf8proc doesn't buy us much 
> here.


Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

>> I don't think we need to worry much about making this tokenizer flexible.  We
>> already offer a certain amount of flexibility via RegexTokenizer.
>
> Yes, making this tokenizer customizable probably isn't worth the effort.  
> I'd be happy with a simple tokenizer that extracts \w+ tokens. I can  
> offer to implement such a tokenizer if it's deemed useful.

A straight up \w+ tokenizer wouldn't be optimal for English, at least.  It
would break on apostrophes, resulting in a large number of solitary 's' tokens
thanks to possesives and contractions -- i.e. "maggie's farm" would tokenize
as ["maggie", "s", "farm"] instead of ["maggie's", "farm"].

Marvin Humphrey

Re: [lucy-dev] Implementing a tokenizer in core

Reply via email to