On Tue, Nov 22, 2011 at 6:50 PM, Marvin Humphrey <[email protected]> wrote: > I don't think we need to worry much about making this tokenizer flexible. We > already offer a certain amount of flexibility via RegexTokenizer.
I agree with this. I think the number of people that need an extremely efficient tokenizer that is also extremely flexible is low. Keep RegexTokenizer as the flexible option, and write this alternative for greater performance. Rather than making it completely configurable, put the emphasis on making it clear, simple, and independent of the inner workings of Lucy. Maybe put it in LucyX (API dogfood), and let it serve as an example for anyone who wants to write their own. My tokenizing needs are theoretical at this point, but the areas that I care about involve tokenizing white space, capitalization, and markup. I'd like to discourage a quoted search for "Proper Name" from matching "is that proper?<br>\nName your price," and I think the easiest way to do this is by indexing some things that would normally be ignored. I also care about punctuation such as Marvin's "Maggie's Farm" apostrophe example, as well as things like like "hyphenated-compound", "C++", "U.S.A.". --nate
