Re: TokenStream and Token APIs

Shai Erera Sun, 12 Oct 2008 03:01:05 -0700

Thanks Michael

LUCENE-1301 was exactly what I was looking for to complete my understanding.
All is clear now


On Sun, Oct 12, 2008 at 10:42 AM, Michael Busch <[EMAIL PROTECTED]> wrote:

> Hi Shai,
>
> I'm going to shuffle your email a bit to answer in a different order...
>
> Shai Erera wrote:
> > Perhaps what you write below is linked to another thread (on flexible
> > indexing maybe?) which I'm not aware of, so I'd appreciate if you can
> > give me a reference.
>
> Mike recently refactored the DocumentsWriter into a bunch of new classes
> (see LUCENE-1301). In fact, we now have an indexer chain, in which you can
> plugin different modules that do something (well, we still have to add an
> API to make use of the chain...)
>
> For example, there are currently two TermsHashConsumers in the default
> chain: the FreqProxTermsWriter and the TermVectorTermsWriter. Both consume
> the tokens from a TokenStream and write the different data structures.
>
> We could for example write a SpanPostingTermsHashConsumer that can not only
> write the start position of a token but also the number of covered
> positions. We could introduce a new interface:
>  public interface SpanAttribute extends PositionIncrementAttribute {
>    public int getLength();
>    public int setLength(int length);
>  }
>
> Only the SpanPostingTermsHashConsumer would need to know the SpanAttribute.
>
>
>> BTW, what I didn't understand from you description is how does the
>> indexing part know which attributes my Token supports? For example, let's
>> say I create a Token which implements only position increments, no payload
>> and perhaps some other custom attribute. I generate a TokenStream returning
>> this Token type.
>> How will Lucene's indexing mechanism know my Token supports only position
>> increments and especially the custom attribute? What will it do with that
>> custom attribute?
>>
>
> The advantage is that the different consumers actually don't need to know
> the exact type of the Token. Each consumer can check via instanceof if the
> prototype Token actually implements the interface(s) that the consumer
> needs. If not, then the consumer can just not process the tokens for that
> particular field. Alternatively we could say that the user needs to make
> sure that the appropriate prototype Token is generated for the indexing
> chain that is configured, otherwise Lucene throws an Exception.
>
> I think the main advantage here is that we can implement consumers that
> only care about particular attributes. Btw, Doug had actually a very similar
> idea for the Token class that he mentioned almost 2 years ago:
> http://www.gossamer-threads.com/lists/lucene/java-dev/43486#43486
>
> > In 3.0 you plan to move to Java 1.5, right? Couldn't you use the Java
> > templates then? Have the calling application pass in the Token
> > template it wants to use and then the consumer does not need to cast
> > anything ...
>
> That only works if we keep the current design in which the consumer has to
> create the Token. But what do you do if you have more than one consumer? (E.
> g. adding a new TermsHashConsumer into the chain?)
>
> -Michael
>
>
>> Shai
>>
>>
>> On Sun, Oct 12, 2008 at 1:33 AM, Michael Busch <[EMAIL PROTECTED]<mailto:
>> [EMAIL PROTECTED]>> wrote:
>>
>>    Hi,
>>
>>    I've been thinking about making the TokenStream and Token APIs more
>>    flexible. E. g. for fields that don't store positions, the Token
>>    doesn't need to have a positionIncrement or a payload. With flexible
>>    indexing on the other hand, people might want to add custom
>>    attributes to a Token that a consumer in the indexing chain could
>>    use then.
>>
>>    Of course it is possible to extend Token, because it is not final,
>>    and add additional attributes to it. But then consumers of the
>>    TokenStream must downcast every instance of the Token object when
>>    they call next(Token).
>>
>>    I was therefore thinking about a different TokenStream API:
>>
>>     public abstract class TokenStream {
>>       public abstract boolean nextToken() throws IOException;
>>
>>       public abstract Token prototypeToken() throws IOException;
>>
>>       public void reset() throws IOException {}
>>
>>       public void close() throws IOException {}
>>     }
>>
>>    Furthermore Token itself would only keep the termBuffer logic and we
>>    could introduce different interfaces, like:
>>
>>     public interface PayloadAttribute {
>>       /**
>>        * Returns this Token's payload.
>>        */
>>       public Payload getPayload();
>>
>>       /**
>>        * Sets this Token's payload.
>>        */
>>       public void setPayload(Payload payload);
>>     }
>>
>>     public interface PositionIncrementAttribute {
>>       /** Set the position increment.  This determines the position of
>>        *  this token relative to the previous Token in a
>>        * [EMAIL PROTECTED] TokenStream}, used in phrase searching.
>>        */
>>       public void setPositionIncrement(int positionIncrement);
>>
>>       /** Returns the position increment of this Token.
>>        * @see #setPositionIncrement
>>        */
>>       public int getPositionIncrement();
>>     }
>>
>>    A consumer, e. g. the DocumentsWriter, does not create a Token
>>    instance itself anymore, but rather calls prototypeToken(). This
>>    method returns a Token subclass which implements all desired
>>    *Attribute interfaces.
>>
>>    If a consumer is e. g. only interested in the positionIncrement and
>>    Payload, it can consume the tokens like this:
>>
>>     public class Consumer {
>>       public void consumeTokens(TokenStream ts) throws IOException {
>>         Token token = ts.prototypeToken();
>>
>>         PayloadAttribute payloadSource = (PayloadAttribute) token;
>>         PositionIncrementAttribute positionSource =
>>                       (PositionIncrementAttribute) token;
>>
>>         while (ts.nextToken()) {
>>           char[] term = token.termBuffer();
>>           int termLength = token.termLength();
>>           int positionIncrement = positionSource.getPositionIncrement();
>>           Payload payload = payloadSource.getPayload();
>>
>>           // do something with the term, positionIncrement and payload
>>         }
>>       }
>>     }
>>
>>    Casting is now only done once after the prototype token was created.
>>    Now if you want to add another consumer in the indexing chain and
>>    realize that you want to add another attribute to the Token, then
>>    you don't have to change this consumer. You only need to create
>>    another Token subclass that implements the new attribute in addition
>>    to the previous ones and can use it in the new consumer.
>>
>>    I haven't tried to implement this yet and maybe there are things I
>>    haven't thought about (like caching TokenFilters). I'd like to get
>>    some feedback about these APIs first to see if this makes sense?
>>
>>    Btw: if we think this (or another) approach to change these APIs
>>    makes sense, then it would be good to change it for 3.0 when we can
>>    break backwards compatibility. And then we should also rethink the
>>    Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible
>>    indexing!
>>
>>    -Michael
>>
>>    ---------------------------------------------------------------------
>>    To unsubscribe, e-mail: [EMAIL PROTECTED]
>>    <mailto:[EMAIL PROTECTED]>
>>    For additional commands, e-mail: [EMAIL PROTECTED]
>>    <mailto:[EMAIL PROTECTED]>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: TokenStream and Token APIs

Reply via email to