This sounds like a good change! Then we'd un-deprecate Token? We could in fact then fix all core tokenizers to use Tokens again.
I think given how simple these interfaces would be, it's an OK situation to use interfaces? (Ie we disregard the normal back-compat curse with interfaces). Mike On Tue, Apr 28, 2009 at 4:22 AM, Michael Busch <busch...@gmail.com> wrote: > Hi Eks Dev, > > I actually started experimenting with changing the new API slightly to > overcome one drawback: with the variables now distributed over various > Attribute classes (vs. being in a single class Token previously), cloning a > "Token" (i.e. calling captureState()) is more expensive. This slows down the > CachingTokenFilter and Tee/Sink-TokenStreams. > > So I was thinking about introducing interfaces for each of the Attributes. > E.g. OffsetAttribute would then be an interface with all current methods, > and OffsetAttributeImpl would be its implementation. The user would still > use the API in exactly the same way as now, that is be e.g. calling > addAttribute(OffsetAttribute.class), and the code takes care of > instantiating the right class. However, there would then also be an API to > pass in an actual instance, and this API would use reflection to find all > interfaces that the instances implements. All of those interfaces that > extend the Attribute interface would be added to the AttributeSource map, > with the instance as the value. > > Then the Token class would implement all six attribute interfaces. An expert > user could decide to pass in a Token instance instead of calling > addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ... > Then the attribute source would only contain a single instance that needs to > be cloned in captureState(), making cloning much faster. And a (probably > also expert) user could even implement an own class that implements exactly > the necessary interfaces (maybe only 3 of the 6 provided), and make cloning > faster than it is even with the old Token-based API. > > And of course also in your case could you just create a different > implementation of such an interface, right? I think what's nice about this > change is that it doesn't make it more complicated to use the TokenStream > API, and the indexing pipeline still uses it the same way too, yet it's more > extensible more expert users and possible to achieve the same or even better > cloning performance. > > I will open a new Jira issue for this soon. But I'd be happy to hear > feedback about the proposed changes, and especially if you think these > changes would help you for your usecase. > > -Michael > > On 4/27/09 1:49 PM, eks dev wrote: > > Should I create a patch with something like this? > > With "Expert" javadoc, and explanation what is this good for should be a > nice addition to Attribute cases. > Practically, it would enable specialization of "hard linked" Attributes like > TermAttribute. > > The only preconditions are: > > - "Specialized Attribute" must extend one of the "hard linked" ones, and > provide class of it > - Must implement default constructor > - should extend by not introducing state (big majority of cases) (not to > break captureState()) > > The last one could be relaxed i guess, but I am not yet 100% familiar with > this code. > > Use cases for this are along the lines of my example, smaller, easier user > code and performance (token filters mainly) > > > > ----- Original Message ---- > > > From: Uwe Schindler <u...@thetaphi.de> > To: java-dev@lucene.apache.org > Sent: Sunday, 26 April, 2009 23:03:06 > Subject: RE: new TokenStream api Question > > There is one problem: if you extend TermAttribute, the class is different > (which is the key in the attributes list). So when you initialize the > TokenStream and do a > > YourClass termAtt = (YourClass) addAttribute(YourClass.class) > > ...you create a new attribute. So one possibility would be to also specify > the instance and save the attribute by class (as key), but with your > instance. If you are the first one that creates the attribute (if it is a > token stream and not a filter it is ok, you will be the first, it adding the > attribute in the ctor), everything is ok. Register the attribute by yourself > (maybe we should add a specialized addAttribute, that can specify a instance > as default)?: > > YourClass termAtt = new YourClass(); > attributes.put(TermAttribute.class, termAtt); > > In this case, for the indexer it is a standard TermAttribute, but you can > more with it. > > Replacing TermAttribute by an own class is not possible, as the indexer will > get a ClassCastException when using the instance retrieved with > getAttribute(TermAttribute.class). > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > From: eks dev [mailto:eks...@yahoo.co.uk] > Sent: Sunday, April 26, 2009 10:39 PM > To: java-dev@lucene.apache.org > Subject: new TokenStream api Question > > > I am just looking into new TermAttribute usage and wonder what would be > the best way to implement PrefixFilter that would filter out some Terms > that have some prefix, > > something like this, where '-' represents my prefix: > > public final boolean incrementToken() throws IOException { > // the first word we found > while (input.incrementToken()) { > int len = termAtt.termLength(); > > if(len > 0 && termAtt.termBuffer()[0]!='-') //only length > 0 and > non LFs > return true; > // note: else we ignore it > } > // reached EOS > return false; > } > > > > > > The question would be: > > can I extend TermAttribute and add boolean startsWith(char c); > > The point is speed and my code gets smaller. > TermAttribute has one method called in termLength() and termBuffer() I do > not understand (back compatibility, I guess) > public int termLength() { > initTermBuffer(); // I'd like to avoid it... > return termLength; > } > > > I'd like to get rid of initTermBuffer(), the first option is to *extend* > TermAttribute code (but fields are private, so no help there) or can I > implement my own MyTermAttribute (will Indexer know how to deal with it?) > > Must I extend TermAttribute or I can add my own? > > thanks, > eks > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org