Greets,

Lucene's TokenStream concept does not translate very well from Java to our target languages, primarily because calling next() once for each analyzer-token pairing creates a lot of method-call overhead. Also, the analysis process creates a *lot* of Token objects; in Plucene these were hash-based objects, relatively expensive to create and destroy.

The solution adopted by KinoSearch is the TokenBatch. Instead of passing tokens one by one through an analysis chain, they are grouped together and passed en masse from analyzer to analyzer. Analyzers do not supply a next() method; they supply an analyze() method, which both accepts and returns a TokenBatch object. Analysis chains are set up using a PolyAnalyzer object, which is an ordered array of Analyzers, each of which will be called upon to analyze() a TokenBatch in turn.

Implementing TokenBatch as a doubly-linked list of Token structs seems to work pretty well.

typedef struct lucy_Token {
    char              *text;
    lucy_i32_t         len;
    lucy_i32_t         start_offset;
    lucy_i32_t         end_offset;
    lucy_i32_t         pos_inc;
    struct lucy_Token *next;
    struct lucy_Token *prev;
} lucy_Token;

typedef struct lucy_TokenBatch {
    lucy_Token  *first;
    lucy_Token  *last;
    lucy_Token  *current;
    lucy_i32_t   size;
    lucy_bool_t  initialized;
} lucy_TokenBatch;

(We might define token->text as either an 8 or 16 bit type depending on how the target platform handles strings, but that's a topic for another day.)

The tricky part here is how to expose TokenBatch and its constituent tokens as a native API, which is necessary for subclassing Analyzer.

Core Analyzer subclasses distributed with Lucy might peek into the guts of Token structs. However, it would be a bad idea to make *any* C struct's makeup part of Lucy's public API. The only supported way to get at a struct's members should be through methods.

The obvious way to affect individual Tokens would be to spin them off as native objects when iterating over the TokenBatch object. Illustrated in Java:

  while ((token = batch.next()) != null) {
    token.setText( lowerCase(token.getText()) );
  }

However, that introduces a garbage collection issue. Who's responsible for destroying the individual Token structs? We can't have them destroyed twice, once when the TokenBatch gets destroyed, and once when the spun-off token gets destroyed. I see two possibilities, neither of which is attractive. We might implement our own reference-counting scheme, which would be messy and complicated. Or we could start off by creating each token as a native object and defer to native GC, which is messy, complicated, and expensive to boot.

The least worst solution I see is to avoid exposing the individual tokens via native API, and allow access to them one at a time via methods against the TokenBatch object. next() would return a boolean indicating whether the TokenBatch was currently located at a valid Token position, instead of a native Token object.

  while (batch.next()) {
    batch.setText( lowerCase(batch.getText()) );
  }

This is perhaps a little less intuitive than iterating over an faux- array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum and Scorer classes all work.

It's also the fastest scheme I've come up with. Without making TokenBatch a native array of Token objects...

  for my $token (@tokens) {
    $token->set_text( $token->get_text );
  }

... there's no way to avoid the method-call overhead of next(). Any user-defined subclass of Analyzer implemented using the native API is therefore going to be a little slower than it's possible to make core Analyzer classes which are allowed to access struct members, but that's the way it goes. If we supply a good set of flexible Analyzer classes, hopefully it won't be necessary for people to create multiple custom Analyzers and their indexing apps will stay fast enough.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Reply via email to