On Tue, Mar 8, 2011 at 9:36 AM, Marvin Humphrey <[email protected]> wrote: > Therefore, I think we should just have a single class named "RegexTokenizer" > which is defined as deferring to the host language's regex engine. Managing > portability across different host languages or different versions of the host > language will be left to the user.
Maybe I'm misunderstanding, but I'd suggest thinking really closely before doing this. I think one of the strengths of Lucy's host-core split is that the core remains language agnostic. Once each index becomes specific to each host language, wouldn't you lose the ability to create the index in one language and access it from another? While there is some advantage to having all the tokenizing be host native, I think there is greater value in being able to do create the index with a good text processing language (Perl in my case) while being able to perform the searches from a compiled language (likely C). I'd suggest instead that RegexTokenizer be host-independent and use something like PCRE. While this might make for a few odd corner cases, I think it will work better in multilingual projects. Make it easy to switch to a different tokenizer, but provide something built in that can be used standalone. But maybe this is a philosophical rather than practical problem: do you view the (future) C API as distinct from Lucy Core? If one wanted to wrap the core up to act as a freestanding HTTP or 0mq server, what would the "host language" be? > If we try to specify > the regex dialect precisely so that the tokenization behavior is fully defined > by the serialized analyzer within the schema file, the only remedy on mismatch > will be to throw an exception and refuse to read the index. I'm not getting this. Is there a failure other than not finding token you search for? I think I can envision cases where you might consciously want to different tokenizers working on the same index: stemming one and not the other, or maybe even indexing bi-grams as a means of boosting ad hoc phrase queries. Nathan Kurz [email protected]
