Re: [lucy-dev] Extending the StandardTokenizer

Nick Wellnhofer Fri, 24 Feb 2012 05:29:52 -0800

On 24/02/2012 00:27, Marvin Humphrey wrote:

On Thu, Feb 23, 2012 at 01:56:33PM +0100, Nick Wellnhofer wrote:
I'm -0 on adding a new non-extendable "WordTokenizer" class, though.
WordTokenizer expands Lucy's public API for no other reason than performance
in a not-very-common use case; that's not a good rationale for taking on the
maintenance burden of a new public class, and it sets a bad precedent.  Next
up will be WhiteSpaceTokenizer, and down the road we go... It will never end,
because users have so many different tokenization requirements.

A second benefit over RegexTokenizer is better Unicode support, althoughI'm not really interested in that personally. I also don't thinkmaintaining another tokenizer class would be much of a problem. I'drather measure the burden of maintainability in lines of code thannumber of public classes. Consequently, I'd be OK with aWhiteSpaceTokenizer and a couple of other tokenizers if theirimplementation is as trivial as the WordTokenizer I proposed.

OTOH, I don't want to force a specialized tokenizer into the code basethat no one besides me deems useful. I'd prefer to work on support forcompiled extensions, so everyone can make small extensions to Lucy in Cwithout seeking public consensus or maintaining their own patchsets.

If you don't have time to work on an extension mechanism for
StandardTokenizer, maybe I can help out.  I'm going to go study up on UAX #29
and how you implemented StandardTokenizer and see if I can come up with
any ideas.

In the meantime, if you want to commit WordTokenizer, I won't object.  FWIW, I
believe that the analogous Lucene class is called "LetterTokenizer", so you
might consider renaming it.

   
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html

I had a look at Lucene's tokenizers. LetterTokenizer seems to work withletters only, whereas my proposed WordTokenizer also works with numbers.I know that the name is maybe a bit too general, but I couldn't come upwith something better.

However, I hope we can manage to create an extension mechanism before the
release of Lucy 0.4.0 which allows the user to code up the equivalent of
WordTokenizer as a user-space subclass, and that you won't object to the
removing WordTokenizer before it escapes into the wild in that case.


Nevermind. I decided against committing my patch in its current form.

Nick

Re: [lucy-dev] Extending the StandardTokenizer

Reply via email to