On 8 Mar 2011, at 5:36 PM, Marvin Humphrey wrote: > Greets, > > Right now, Lucy only has one tokenizer-style Analyzer subclass: > Lucy::Analysis::Tokenizer, which is regex based. > > At some point, I expect we will have other tokenizer classes which don't use a > regex engine, so I think it would be best to reserve the name "Tokenizer" for > future use and rename the current Tokenizer to "RegexTokenizer". > > Another possibility would be "PerlRegexTokenizer", embedding the regex dialect > that will be used to interpret the supplied pattern in the class name. > However, the exact behavior of the regular expression engine is not consistent > across different versions of Perl. In general, it's not going to be possible > to translate a pattern between different regex engines. If we try to specify > the regex dialect precisely so that the tokenization behavior is fully defined > by the serialized analyzer within the schema file, the only remedy on mismatch > will be to throw an exception and refuse to read the index. > > Therefore, I think we should just have a single class named "RegexTokenizer" > which is defined as deferring to the host language's regex engine. Managing > portability across different host languages or different versions of the host > language will be left to the user. > > Marvin Humphrey
Sounds like a reasonable approach. Tokenizer for the interface and RegexTokenizer for platform-specific regexes (which, in fairness, is kinda what people would expect anyway). Many things support Perl5 regexes to varying degrees, so you'd likely not have too much trouble from a portability perspective. If you wanted to lock it in across host languages, then you could always implement this in C using the library of your choice due to the architecture, right? Cheers, ast -- Andrew S. Townley <[email protected]> http://atownley.org
