Greets, Right now, Lucy only has one tokenizer-style Analyzer subclass: Lucy::Analysis::Tokenizer, which is regex based.
At some point, I expect we will have other tokenizer classes which don't use a regex engine, so I think it would be best to reserve the name "Tokenizer" for future use and rename the current Tokenizer to "RegexTokenizer". Another possibility would be "PerlRegexTokenizer", embedding the regex dialect that will be used to interpret the supplied pattern in the class name. However, the exact behavior of the regular expression engine is not consistent across different versions of Perl. In general, it's not going to be possible to translate a pattern between different regex engines. If we try to specify the regex dialect precisely so that the tokenization behavior is fully defined by the serialized analyzer within the schema file, the only remedy on mismatch will be to throw an exception and refuse to read the index. Therefore, I think we should just have a single class named "RegexTokenizer" which is defined as deferring to the host language's regex engine. Managing portability across different host languages or different versions of the host language will be left to the user. Marvin Humphrey
