Re: new TokenStream api Question

Michael McCandless Tue, 28 Apr 2009 04:32:24 -0700

This sounds like a good change!

Then we'd un-deprecate Token?  We could in fact then fix all core
tokenizers to use Tokens again.


I think given how simple these interfaces would be, it's an OK
situation to use interfaces? (Ie we disregard the normal back-compat
curse with interfaces).

Mike

On Tue, Apr 28, 2009 at 4:22 AM, Michael Busch <busch...@gmail.com> wrote:
> Hi Eks Dev,
>
> I actually started experimenting with changing the new API slightly to
> overcome one drawback: with the variables now distributed over various
> Attribute classes (vs. being in a single class Token previously), cloning a
> "Token" (i.e. calling captureState()) is more expensive. This slows down the
> CachingTokenFilter and Tee/Sink-TokenStreams.
>
> So I was thinking about introducing interfaces for each of the Attributes.
> E.g. OffsetAttribute would then be an interface with all current methods,
> and OffsetAttributeImpl would be its implementation. The user would still
> use the API in exactly the same way as now, that is be e.g. calling
> addAttribute(OffsetAttribute.class), and the code takes care of
> instantiating the right class. However, there would then also be an API to
> pass in an actual instance, and this API would use reflection to find all
> interfaces that the instances implements. All of those interfaces that
> extend the Attribute interface would be added to the AttributeSource map,
> with the instance as the value.
>
> Then the Token class would implement all six attribute interfaces. An expert
> user could decide to pass in a Token instance instead of calling
> addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class), ...
> Then the attribute source would only contain a single instance that needs to
> be cloned in captureState(), making cloning much faster. And a (probably
> also expert) user could even implement an own class that implements exactly
> the necessary interfaces (maybe only 3 of the 6 provided), and make cloning
> faster than it is even with the old Token-based API.
>
> And of course also in your case could you just create a different
> implementation of such an interface, right? I think what's nice about this
> change is that it doesn't make it more complicated to use the TokenStream
> API, and the indexing pipeline still uses it the same way too, yet it's more
> extensible more expert users and possible to achieve the same or even better
> cloning performance.
>
> I will open a new Jira issue for this soon. But I'd be happy to hear
> feedback about the proposed changes, and especially if you think these
> changes would help you for your usecase.
>
> -Michael
>
> On 4/27/09 1:49 PM, eks dev wrote:
>
> Should I create a patch with something like this?
>
> With "Expert" javadoc, and explanation what is this good for should be a
> nice addition to Attribute cases.
> Practically, it would enable specialization of "hard linked" Attributes like
> TermAttribute.
>
> The only preconditions are:
>
> - "Specialized Attribute" must extend one of the "hard linked" ones, and
> provide class of it
> - Must implement default constructor
> - should extend by not introducing state (big majority of cases) (not to
> break captureState())
>
> The last one could be relaxed i guess, but I am not yet 100% familiar with
> this code.
>
> Use cases for this are along the lines of my example, smaller, easier user
> code and performance (token filters mainly)
>
>
>
> ----- Original Message ----
>
>
> From: Uwe Schindler <u...@thetaphi.de>
> To: java-dev@lucene.apache.org
> Sent: Sunday, 26 April, 2009 23:03:06
> Subject: RE: new TokenStream api Question
>
> There is one problem: if you extend TermAttribute, the class is different
> (which is the key in the attributes list). So when you initialize the
> TokenStream and do a
>
> YourClass termAtt = (YourClass) addAttribute(YourClass.class)
>
> ...you create a new attribute. So one possibility would be to also specify
> the instance and save the attribute by class (as key), but with your
> instance. If you are the first one that creates the attribute (if it is a
> token stream and not a filter it is ok, you will be the first, it adding the
> attribute in the ctor), everything is ok. Register the attribute by yourself
> (maybe we should add a specialized addAttribute, that can specify a instance
> as default)?:
>
> YourClass termAtt = new YourClass();
> attributes.put(TermAttribute.class, termAtt);
>
> In this case, for the indexer it is a standard TermAttribute, but you can
> more with it.
>
> Replacing TermAttribute by an own class is not possible, as the indexer will
> get a ClassCastException when using the instance retrieved with
> getAttribute(TermAttribute.class).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>
> -----Original Message-----
> From: eks dev [mailto:eks...@yahoo.co.uk]
> Sent: Sunday, April 26, 2009 10:39 PM
> To: java-dev@lucene.apache.org
> Subject: new TokenStream api Question
>
>
> I am just looking into new TermAttribute usage and wonder what would be
> the best way to implement PrefixFilter that would filter out some Terms
> that have some prefix,
>
> something like this, where '-' represents my prefix:
>
>   public final boolean incrementToken() throws IOException {
>     // the first word we found
>     while (input.incrementToken()) {
>       int len = termAtt.termLength();
>
>       if(len > 0 && termAtt.termBuffer()[0]!='-') //only length > 0 and
> non LFs
>     return true;
>       // note: else we ignore it
>     }
>     // reached EOS
>     return false;
>   }
>
>
>
>
>
> The question would be:
>
> can I extend TermAttribute and add boolean startsWith(char c);
>
> The point is speed and my code gets smaller.
> TermAttribute has one method called in termLength() and termBuffer() I do
> not understand (back compatibility, I guess)
>   public int termLength() {
>     initTermBuffer(); // I'd like to avoid it...
>     return termLength;
>   }
>
>
> I'd like to get rid of initTermBuffer(), the first option is to *extend*
> TermAttribute code (but fields are private, so no help there) or can I
> implement my own MyTermAttribute (will Indexer know how to deal with it?)
>
> Must I extend TermAttribute or I can add my own?
>
> thanks,
> eks
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: new TokenStream api Question

Reply via email to