[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

Michael Busch (JIRA) Tue, 27 Nov 2007 15:52:14 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546050
 ]


Michael Busch commented on LUCENE-1058:
---------------------------------------

We need to change the CachingTokenFilter a bit (untested code):

{code:java}
public class CachingTokenFilter extends TokenFilter {
  private List cache;
  private Iterator iterator;
  
  public CachingTokenFilter(TokenStream input) {
    super(input);
    this.cache = new LinkedList();
  }
  
  public Token next() throws IOException {
    if (iterator != null) {
      if (!iterator.hasNext()) {
        // the cache is exhausted, return null
        return null;
      }   
      return (Token) iterator.next();
    } else {
      Token token = input.next();
      addTokenToCache(token);
      return token;
    }
  }
  
  public void reset() throws IOException {
    if(cache != null) {
        iterator = cache.iterator();
    }
  }
  
  protected void addTokenToCache(Token token) {
    if (token != null) {
      cache.add(token);
    }
  }
}
{code}

Then you can implement the ProperNounTF:

{code:java}
class ProperNounTF extends CachingTokenFilter {
  protected void addTokenToCache(Token token) {
    if (token != null && isProperNoun(token)) {
      cache.add(token);
    }
  }
  
  private boolean isProperNoun() {...}
}  
{code}

And then you add everything to Document:

{code:java}
Document d = new Document();
TokenStream properNounTf = new ProperNounTF(new StandardTokenizer(reader));
TokenStream stdTf = new CachingTokenFilter(new StopTokenFilter(properNounTf));
TokenStrean lowerCaseTf = new LowerCaseTokenFilter(stdTf);


d.add(new Field("std", stdTf));
d.add(new Field("nouns", properNounTf));
d.add(new Field("lowerCase", lowerCaseTf));
{code}

Again, this is untested, but I believe should work? 

> New Analyzer for buffering tokens
> ---------------------------------
>
>                 Key: LUCENE-1058
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1058
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

Reply via email to