On Thu, Feb 23, 2012 at 01:56:33PM +0100, Nick Wellnhofer wrote: > See the attached patch. I moved the word break property lookup to a new > method and override that method in a new "WordTokenizer" class.
As usual, your code is solid and clear, Nick. It's always a pleasure to read your patches. Making StandardTokenizer extensible based on named properties is a cool idea. It's tricky to get right API-design-wise, but it has the potential to serve a lot of different use cases. I'm -0 on adding a new non-extendable "WordTokenizer" class, though. WordTokenizer expands Lucy's public API for no other reason than performance in a not-very-common use case; that's not a good rationale for taking on the maintenance burden of a new public class, and it sets a bad precedent. Next up will be WhiteSpaceTokenizer, and down the road we go... It will never end, because users have so many different tokenization requirements. If you don't have time to work on an extension mechanism for StandardTokenizer, maybe I can help out. I'm going to go study up on UAX #29 and how you implemented StandardTokenizer and see if I can come up with any ideas. In the meantime, if you want to commit WordTokenizer, I won't object. FWIW, I believe that the analogous Lucene class is called "LetterTokenizer", so you might consider renaming it. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html However, I hope we can manage to create an extension mechanism before the release of Lucy 0.4.0 which allows the user to code up the equivalent of WordTokenizer as a user-space subclass, and that you won't object to the removing WordTokenizer before it escapes into the wild in that case. Marvin Humphrey
