Re: TokenStream and Token APIs

Michael McCandless Mon, 13 Oct 2008 02:07:11 -0700


This looks good!

One question on back compatibility: currently, TokenStream.nextTokentakes a Token arg in, and returns a Token back, such that the methodis encouraged but not required to use the passed-in Token as itsprototype.

You are adding a boolean nextToken() method, which then forces thereuse (which I think is good) but you need to ensure older TokenStreamimpls still work. I guess this amounts to a default implementation ofboolean nextToken() in the base TokenStream class.


Mike

Michael Busch wrote:

Hi,
I've been thinking about making the TokenStream and Token APIs moreflexible. E. g. for fields that don't store positions, the Tokendoesn't need to have a positionIncrement or a payload. With flexibleindexing on the other hand, people might want to add customattributes to a Token that a consumer in the indexing chain coulduse then.
Of course it is possible to extend Token, because it is not final,and add additional attributes to it. But then consumers of theTokenStream must downcast every instance of the Token object whenthey call next(Token).
I was therefore thinking about a different TokenStream API:

 public abstract class TokenStream {
   public abstract boolean nextToken() throws IOException;

   public abstract Token prototypeToken() throws IOException;

   public void reset() throws IOException {}

   public void close() throws IOException {}
 }
Furthermore Token itself would only keep the termBuffer logic and wecould introduce different interfaces, like:
 public interface PayloadAttribute {
   /**
    * Returns this Token's payload.
    */
   public Payload getPayload();

   /**
    * Sets this Token's payload.
    */
   public void setPayload(Payload payload);
 }

 public interface PositionIncrementAttribute {
   /** Set the position increment.  This determines the position of
    *  this token relative to the previous Token in a
    * [EMAIL PROTECTED] TokenStream}, used in phrase searching.
    */
   public void setPositionIncrement(int positionIncrement);

   /** Returns the position increment of this Token.
    * @see #setPositionIncrement
    */
   public int getPositionIncrement();
 }
A consumer, e. g. the DocumentsWriter, does not create a Tokeninstance itself anymore, but rather calls prototypeToken(). Thismethod returns a Token subclass which implements all desired*Attribute interfaces.
If a consumer is e. g. only interested in the positionIncrement andPayload, it can consume the tokens like this:
 public class Consumer {
   public void consumeTokens(TokenStream ts) throws IOException {
     Token token = ts.prototypeToken();

     PayloadAttribute payloadSource = (PayloadAttribute) token;
     PositionIncrementAttribute positionSource =
                   (PositionIncrementAttribute) token;

     while (ts.nextToken()) {
       char[] term = token.termBuffer();
       int termLength = token.termLength();
       int positionIncrement = positionSource.getPositionIncrement();
       Payload payload = payloadSource.getPayload();

       // do something with the term, positionIncrement and payload
     }
   }
 }
Casting is now only done once after the prototype token was created.Now if you want to add another consumer in the indexing chain andrealize that you want to add another attribute to the Token, thenyou don't have to change this consumer. You only need to createanother Token subclass that implements the new attribute in additionto the previous ones and can use it in the new consumer.
I haven't tried to implement this yet and maybe there are things Ihaven't thought about (like caching TokenFilters). I'd like to getsome feedback about these APIs first to see if this makes sense?
Btw: if we think this (or another) approach to change these APIsmakes sense, then it would be good to change it for 3.0 when we canbreak backwards compatibility. And then we should also rethink theFieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexibleindexing!
-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokenStream and Token APIs

Reply via email to