[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory

Tim Smith (JIRA) Fri, 21 Aug 2009 06:36:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745969#action_12745969
 ]


Tim Smith commented on LUCENE-1826:
-----------------------------------

bq. This is not possible per design. The AttributeSource cannot be changed.
I fully understand why

but...
it should be rather easy to add a reset(AttributeSource input) to 
AttributeSource
{code}
public void reset(AttributeSource input) {
    if (input == null) {
      throw new IllegalArgumentException("input AttributeSource must not be 
null");
    }
    this.attributes = input.attributes;
    this.attributeImpls = input.attributeImpls;
    this.factory = input.factory;
}
{code}

This would require making attributes and attributeImpls non-final (potentially 
reducing some jvm caching capabilities)

However, this then provides the ability to do even more Attribute reuse
For example, if this method existed, the Indexer could use a ThreadLocal of raw 
AttributeSources (one AttributeSource per thread)
then, prior to calling TokenStream.reset(), it could call 
TokenStream.reset(ThreadLocal AttributeSource)

This would result in all token streams for the same document using the same 
AttributeSource (reusing TermAttribute, etc)

This would require that the no TokenStreams/Filters/Tokenizers call 
addAttribute() in the constructor (they would have to do this in reset())

I totally get that this is a tall order
If you want i can open a separate ticket for this 
(AttributeSource.reset(AttributeSource)) for further consideration



> All Tokenizer implementations should have constructors that take 
> AttributeSource and AttributeFactory
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1826
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Assignee: Michael Busch
>             Fix For: 2.9
>
>
> I have a TokenStream implementation that joins together multiple sub 
> TokenStreams (i then do additional filtering on top of this, so i can't just 
> have the indexer do the merging)
> in 2.4, this worked fine.
> once one sub stream was exhausted, i just started using the next stream 
> however, in 2.9, this is very difficult, and requires copying Term buffers 
> for every token being aggregated
> however, if all the sub TokenStreams share the same AttributeSource, and my 
> "concat" TokenStream shares the same AttributeSource, this goes back to being 
> very simple (and very efficient)
> So for example, i would like to see the following constructor added to 
> StandardTokenizer:
> {code}
>   public StandardTokenizer(AttributeSource source, Reader input, boolean 
> replaceInvalidAcronym) {
>     super(source);
>     ...
>   }
> {code}
> would likewise want similar constructors added to all Tokenizer sub classes 
> provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory

Reply via email to