On 20/02/2012 18:27, Marvin Humphrey wrote:
On Mon, Feb 20, 2012 at 01:52:26PM +0100, Nick Wellnhofer wrote:
Currently, the new StandardTokenizer implements the word break algorithm
as defined in Unicode Annex #29. One detail of this algorithm is that it
defines a set of "MidLetter" and "MidNum" characters which don't break a
sequence of letters or numbers. It seems the main reason is to not break
around characters like apostrophes or number separators.

While some people might prefer this behavior, I'd like to add second
mode of operation that does split on all characters that are not
alphanumeric with the exception of underscores. This would very much
resemble a RegexTokenizer with a \w+ pattern.

The documentation for the Lucene StandardTokenizer contains this paragraph:

     Many applications have specific tokenizer needs. If this tokenizer does
     not suit your application, please consider copying this source code
     directory to your project and maintaining your own grammar-based
     tokenizer.

There is a lot of accumulated wisdom in that passage that I think we ought to
consider.

The whole thing could be implemented by simply adding an option to
StandardTokenizer so that "MidLetter" and "MidNum" characters are ignored.

I'm concerned that this is may be the first feature request of many to come
for StandardTokenizer, and that attempting to support all such requests within
core is not sustainable.

I understand your concern, but I think the extension I proposed is the most useful and obvious. I'm biased, of course, as other people will have different needs.

To address the immediate concern, is it an option to just use RegexTokenizer
with a \w+ pattern?  RegexTokenizer's primary utility is that it solves many,
many use cases while posing a minimal ongoing maintenance burden.

A plain \w+ pattern would work for me. I'm mainly interested in the performance benefits of StandardTokenizer.

Actually, you can formulate the complete UAX#29 word breaking rules as a Perl regex which is even quite readable. But performance would probably suffer even more because you'd have to use Perl's \p{} construct to lookup word break properties.

Thinking longer term, I believe the interests of all would be best served if
the progression went something like this:

    1. Start with StandardTokenizer as the default.
    2. Change to RegexTokenizer (or others) to address logical requirements
       that StandardTokenizer does not meet.
    3. Compile your own Tokenizer when you need to max out performance.

In other words, our focus should be on making it possible to "Extend
StandardTokenizer" (and potentially other Analyzers) arbitrarily.

One solution I've been thinking about is to make StandardTokenizer work with arbitrary word break property tables. That is, use the rules described in UAX#29 but allow for customized mappings of the word break property which should cover many use cases. This would basically mean to port the code in devel/bin/UnicodeTable.pm to C and provide a nice public interface. It's certainly feasible but there are some challenges involved, serialization for example.

If that goal seems to far away, then my next suggestion would be to create a
LucyX class to house a StandardTokenizer embellished with arbitrary extensions
-- working name: LucyX::Analysis::NonStandardTokenizer.

That would be OK with me. On another note, is it possible to package Lucy extensions that contain C code outside of the main source tree?

Nick

Reply via email to