Marvin Humphrey wrote on 2/23/12 5:27 PM: > > In the meantime, if you want to commit WordTokenizer, I won't object. FWIW, I > believe that the analogous Lucene class is called "LetterTokenizer", so you > might consider renaming it. > > > http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html > > However, I hope we can manage to create an extension mechanism before the > release of Lucy 0.4.0 which allows the user to code up the equivalent of > WordTokenizer as a user-space subclass, and that you won't object to the > removing WordTokenizer before it escapes into the wild in that case. >
+1 to Marvin's hope. -0 to committing the code, and -1 to the name WordTokenizer. Maybe this is bikeshedding, but that name just seems misleading. Does the Standard tokenizer also tokenize words? The term 'word' is just too overloaded here. LetterTokenizer is slightly better, but I share Marvin's hope that we can find a way to get the performance love at the host language subclass level without needing to support multiple variations on a theme in the core dist. Let's figure out a way to avoid the naming conversation altogether and extend the StandardTokenizer to do what you need, Nick. -- Peter Karman . http://peknet.com/ . [email protected]
