Re: [lucy-dev] Extending the StandardTokenizer

Peter Karman Thu, 23 Feb 2012 19:21:14 -0800

Marvin Humphrey wrote on 2/23/12 5:27 PM:

> 
> In the meantime, if you want to commit WordTokenizer, I won't object.  FWIW, I
> believe that the analogous Lucene class is called "LetterTokenizer", so you
> might consider renaming it.
> 
>   
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html
> 
> However, I hope we can manage to create an extension mechanism before the
> release of Lucy 0.4.0 which allows the user to code up the equivalent of
> WordTokenizer as a user-space subclass, and that you won't object to the
> removing WordTokenizer before it escapes into the wild in that case.
>


+1 to Marvin's hope.

-0 to committing the code, and -1 to the name WordTokenizer. Maybe this is
bikeshedding, but that name just seems misleading. Does the Standard tokenizer
also tokenize words? The term 'word' is just too overloaded here.
LetterTokenizer is slightly better, but I share Marvin's hope that we can find a
way to get the performance love at the host language subclass level without
needing to support multiple variations on a theme in the core dist.

Let's figure out a way to avoid the naming conversation altogether and extend
the StandardTokenizer to do what you need, Nick.


-- 
Peter Karman  .  http://peknet.com/  .  [email protected]

Re: [lucy-dev] Extending the StandardTokenizer

Reply via email to