Thanks Michael LUCENE-1301 was exactly what I was looking for to complete my understanding. All is clear now
On Sun, Oct 12, 2008 at 10:42 AM, Michael Busch <[EMAIL PROTECTED]> wrote: > Hi Shai, > > I'm going to shuffle your email a bit to answer in a different order... > > Shai Erera wrote: > > Perhaps what you write below is linked to another thread (on flexible > > indexing maybe?) which I'm not aware of, so I'd appreciate if you can > > give me a reference. > > Mike recently refactored the DocumentsWriter into a bunch of new classes > (see LUCENE-1301). In fact, we now have an indexer chain, in which you can > plugin different modules that do something (well, we still have to add an > API to make use of the chain...) > > For example, there are currently two TermsHashConsumers in the default > chain: the FreqProxTermsWriter and the TermVectorTermsWriter. Both consume > the tokens from a TokenStream and write the different data structures. > > We could for example write a SpanPostingTermsHashConsumer that can not only > write the start position of a token but also the number of covered > positions. We could introduce a new interface: > public interface SpanAttribute extends PositionIncrementAttribute { > public int getLength(); > public int setLength(int length); > } > > Only the SpanPostingTermsHashConsumer would need to know the SpanAttribute. > > >> BTW, what I didn't understand from you description is how does the >> indexing part know which attributes my Token supports? For example, let's >> say I create a Token which implements only position increments, no payload >> and perhaps some other custom attribute. I generate a TokenStream returning >> this Token type. >> How will Lucene's indexing mechanism know my Token supports only position >> increments and especially the custom attribute? What will it do with that >> custom attribute? >> > > The advantage is that the different consumers actually don't need to know > the exact type of the Token. Each consumer can check via instanceof if the > prototype Token actually implements the interface(s) that the consumer > needs. If not, then the consumer can just not process the tokens for that > particular field. Alternatively we could say that the user needs to make > sure that the appropriate prototype Token is generated for the indexing > chain that is configured, otherwise Lucene throws an Exception. > > I think the main advantage here is that we can implement consumers that > only care about particular attributes. Btw, Doug had actually a very similar > idea for the Token class that he mentioned almost 2 years ago: > http://www.gossamer-threads.com/lists/lucene/java-dev/43486#43486 > > > In 3.0 you plan to move to Java 1.5, right? Couldn't you use the Java > > templates then? Have the calling application pass in the Token > > template it wants to use and then the consumer does not need to cast > > anything ... > > That only works if we keep the current design in which the consumer has to > create the Token. But what do you do if you have more than one consumer? (E. > g. adding a new TermsHashConsumer into the chain?) > > -Michael > > >> Shai >> >> >> On Sun, Oct 12, 2008 at 1:33 AM, Michael Busch <[EMAIL PROTECTED]<mailto: >> [EMAIL PROTECTED]>> wrote: >> >> Hi, >> >> I've been thinking about making the TokenStream and Token APIs more >> flexible. E. g. for fields that don't store positions, the Token >> doesn't need to have a positionIncrement or a payload. With flexible >> indexing on the other hand, people might want to add custom >> attributes to a Token that a consumer in the indexing chain could >> use then. >> >> Of course it is possible to extend Token, because it is not final, >> and add additional attributes to it. But then consumers of the >> TokenStream must downcast every instance of the Token object when >> they call next(Token). >> >> I was therefore thinking about a different TokenStream API: >> >> public abstract class TokenStream { >> public abstract boolean nextToken() throws IOException; >> >> public abstract Token prototypeToken() throws IOException; >> >> public void reset() throws IOException {} >> >> public void close() throws IOException {} >> } >> >> Furthermore Token itself would only keep the termBuffer logic and we >> could introduce different interfaces, like: >> >> public interface PayloadAttribute { >> /** >> * Returns this Token's payload. >> */ >> public Payload getPayload(); >> >> /** >> * Sets this Token's payload. >> */ >> public void setPayload(Payload payload); >> } >> >> public interface PositionIncrementAttribute { >> /** Set the position increment. This determines the position of >> * this token relative to the previous Token in a >> * [EMAIL PROTECTED] TokenStream}, used in phrase searching. >> */ >> public void setPositionIncrement(int positionIncrement); >> >> /** Returns the position increment of this Token. >> * @see #setPositionIncrement >> */ >> public int getPositionIncrement(); >> } >> >> A consumer, e. g. the DocumentsWriter, does not create a Token >> instance itself anymore, but rather calls prototypeToken(). This >> method returns a Token subclass which implements all desired >> *Attribute interfaces. >> >> If a consumer is e. g. only interested in the positionIncrement and >> Payload, it can consume the tokens like this: >> >> public class Consumer { >> public void consumeTokens(TokenStream ts) throws IOException { >> Token token = ts.prototypeToken(); >> >> PayloadAttribute payloadSource = (PayloadAttribute) token; >> PositionIncrementAttribute positionSource = >> (PositionIncrementAttribute) token; >> >> while (ts.nextToken()) { >> char[] term = token.termBuffer(); >> int termLength = token.termLength(); >> int positionIncrement = positionSource.getPositionIncrement(); >> Payload payload = payloadSource.getPayload(); >> >> // do something with the term, positionIncrement and payload >> } >> } >> } >> >> Casting is now only done once after the prototype token was created. >> Now if you want to add another consumer in the indexing chain and >> realize that you want to add another attribute to the Token, then >> you don't have to change this consumer. You only need to create >> another Token subclass that implements the new attribute in addition >> to the previous ones and can use it in the new consumer. >> >> I haven't tried to implement this yet and maybe there are things I >> haven't thought about (like caching TokenFilters). I'd like to get >> some feedback about these APIs first to see if this makes sense? >> >> Btw: if we think this (or another) approach to change these APIs >> makes sense, then it would be good to change it for 3.0 when we can >> break backwards compatibility. And then we should also rethink the >> Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible >> indexing! >> >> -Michael >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> >> For additional commands, e-mail: [EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >