Re: [lucy-dev] Extending the StandardTokenizer

Marvin Humphrey Thu, 23 Feb 2012 15:28:23 -0800

On Thu, Feb 23, 2012 at 01:56:33PM +0100, Nick Wellnhofer wrote:
> See the attached patch. I moved the word break property lookup to a new  
> method and override that method in a new "WordTokenizer" class.


As usual, your code is solid and clear, Nick.  It's always a pleasure to
read your patches.

Making StandardTokenizer extensible based on named properties is a cool idea.
It's tricky to get right API-design-wise, but it has the potential to serve a
lot of different use cases.

I'm -0 on adding a new non-extendable "WordTokenizer" class, though.
WordTokenizer expands Lucy's public API for no other reason than performance
in a not-very-common use case; that's not a good rationale for taking on the
maintenance burden of a new public class, and it sets a bad precedent.  Next
up will be WhiteSpaceTokenizer, and down the road we go... It will never end,
because users have so many different tokenization requirements.

If you don't have time to work on an extension mechanism for
StandardTokenizer, maybe I can help out.  I'm going to go study up on UAX #29
and how you implemented StandardTokenizer and see if I can come up with
any ideas.

In the meantime, if you want to commit WordTokenizer, I won't object.  FWIW, I
believe that the analogous Lucene class is called "LetterTokenizer", so you
might consider renaming it.

  
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html

However, I hope we can manage to create an extension mechanism before the
release of Lucy 0.4.0 which allows the user to code up the equivalent of
WordTokenizer as a user-space subclass, and that you won't object to the
removing WordTokenizer before it escapes into the wild in that case.

Marvin Humphrey

Re: [lucy-dev] Extending the StandardTokenizer

Reply via email to