Re: [lucy-dev] Implementing a tokenizer in core

Nathan Kurz Fri, 25 Nov 2011 14:36:02 -0800

On Tue, Nov 22, 2011 at 6:50 PM, Marvin Humphrey <[email protected]> wrote:
> I don't think we need to worry much about making this tokenizer flexible.  We
> already offer a certain amount of flexibility via RegexTokenizer.


I agree with this.  I think the number of people that need an
extremely efficient tokenizer that is also extremely flexible is low.
Keep RegexTokenizer as the flexible option, and write this alternative
for greater performance.  Rather than making it completely
configurable, put the emphasis on making it clear, simple, and
independent of the inner workings of Lucy.   Maybe put it in LucyX
(API dogfood), and let it serve as an example for anyone who wants to
write their own.

My tokenizing needs are theoretical at this point, but the areas that
I care about involve tokenizing white space, capitalization, and
markup.   I'd like to discourage a quoted search for "Proper Name"
from matching "is that proper?<br>\nName your price," and I think the
easiest way to do this is by indexing some things that would normally
be ignored.   I also care about punctuation such as Marvin's "Maggie's
Farm" apostrophe example, as well as things like  like
"hyphenated-compound", "C++", "U.S.A.".

--nate

Re: [lucy-dev] Implementing a tokenizer in core

Reply via email to