Re: [lucy-dev] RegexTokenizer

Andrew S. Townley Tue, 08 Mar 2011 09:51:08 -0800

On 8 Mar 2011, at 5:36 PM, Marvin Humphrey wrote:

> Greets,
> 
> Right now, Lucy only has one tokenizer-style Analyzer subclass:
> Lucy::Analysis::Tokenizer, which is regex based.  
> 
> At some point, I expect we will have other tokenizer classes which don't use a
> regex engine, so I think it would be best to reserve the name "Tokenizer" for
> future use and rename the current Tokenizer to "RegexTokenizer".
> 
> Another possibility would be "PerlRegexTokenizer", embedding the regex dialect
> that will be used to interpret the supplied pattern in the class name.
> However, the exact behavior of the regular expression engine is not consistent
> across different versions of Perl.  In general, it's not going to be possible
> to translate a pattern between different regex engines.  If we try to specify
> the regex dialect precisely so that the tokenization behavior is fully defined
> by the serialized analyzer within the schema file, the only remedy on mismatch
> will be to throw an exception and refuse to read the index.
> 
> Therefore, I think we should just have a single class named "RegexTokenizer"
> which is defined as deferring to the host language's regex engine.  Managing
> portability across different host languages or different versions of the host
> language will be left to the user.
> 
> Marvin Humphrey


Sounds like a reasonable approach.  Tokenizer for the interface and 
RegexTokenizer for platform-specific regexes (which, in fairness, is kinda what 
people would expect anyway).

Many things support Perl5 regexes to varying degrees, so you'd likely not have 
too much trouble from a portability perspective.  If you wanted to lock it in 
across host languages, then you could always implement this in C using the 
library of your choice due to the architecture, right?

Cheers,

ast
--
Andrew S. Townley <[email protected]>
http://atownley.org

Re: [lucy-dev] RegexTokenizer

Reply via email to