On 24/02/2012 00:27, Marvin Humphrey wrote:
On Thu, Feb 23, 2012 at 01:56:33PM +0100, Nick Wellnhofer wrote:
I'm -0 on adding a new non-extendable "WordTokenizer" class, though.
WordTokenizer expands Lucy's public API for no other reason than performance
in a not-very-common use case; that's not a good rationale for taking on the
maintenance burden of a new public class, and it sets a bad precedent.  Next
up will be WhiteSpaceTokenizer, and down the road we go... It will never end,
because users have so many different tokenization requirements.

A second benefit over RegexTokenizer is better Unicode support, although I'm not really interested in that personally. I also don't think maintaining another tokenizer class would be much of a problem. I'd rather measure the burden of maintainability in lines of code than number of public classes. Consequently, I'd be OK with a WhiteSpaceTokenizer and a couple of other tokenizers if their implementation is as trivial as the WordTokenizer I proposed.

OTOH, I don't want to force a specialized tokenizer into the code base that no one besides me deems useful. I'd prefer to work on support for compiled extensions, so everyone can make small extensions to Lucy in C without seeking public consensus or maintaining their own patchsets.

If you don't have time to work on an extension mechanism for
StandardTokenizer, maybe I can help out.  I'm going to go study up on UAX #29
and how you implemented StandardTokenizer and see if I can come up with
any ideas.

In the meantime, if you want to commit WordTokenizer, I won't object.  FWIW, I
believe that the analogous Lucene class is called "LetterTokenizer", so you
might consider renaming it.

   
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html

I had a look at Lucene's tokenizers. LetterTokenizer seems to work with letters only, whereas my proposed WordTokenizer also works with numbers. I know that the name is maybe a bit too general, but I couldn't come up with something better.

However, I hope we can manage to create an extension mechanism before the
release of Lucy 0.4.0 which allows the user to code up the equivalent of
WordTokenizer as a user-space subclass, and that you won't object to the
removing WordTokenizer before it escapes into the wild in that case.

Nevermind. I decided against committing my patch in its current form.

Nick

Reply via email to