Re: Access next token in a stream

Damerian Thu, 09 Feb 2012 14:00:38 -0800

Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:

Damerian,


The technique I mentioned would work for you with a little tweaking: when you 
see consecutive capitalized tokens, then just set the CharTermAttribute to the 
joined tokens, and clear the previous token.

Another idea: you could use ShingleFilter with min size = max size = 2, and then use a 
following Filter extending FilteringTokenFilter, with an accept() method that examines 
shingles and rejects ones that don't qualify, something like the following.  (Notes: this 
is untested; I assume you will use the default shingle token separator " "; and 
this filter will reject all non-shingle terms, so you won't get anything but names, even 
if you configure ShingleFilter to emit single tokens):

public final class MyNameFilter extends FilteringTokenFilter {
   private static final Pattern NAME_PATTERN
       = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
   private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
   @Override public boolean accept() throws IOException {
     return NAME_PATTERN.matcher(termAtt).matches();
   }
}

Steve

-----Original Message-----
From: Damerian [mailto:dameria...@gmail.com]
Sent: Thursday, February 09, 2012 4:15 PM
To: java-user@lucene.apache.org
Subject: Re: Access next token in a stream

Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:

Hi Damerian,

One way to handle your scenario is to hold on to the previous token, and

only emit a token after you reach at least the second token (or at end-of-
stream).  Your incrementToken() method could look something like:

1. Get current attributes: input.incrementToken()
2. If previous token does not exist:
        2a. Store current attributes as previous token (see

AttributeSource#cloneAttributes)

        2b. Get current attributes: input.incrementToken()
3. Check for&   store conditions that will affect previous token's

attributes

4. Store current attributes as next token (see

AttributeSource#cloneAttributes)

5. Copy previous token into current attributes (see

AttributeSource#copyTo);

     the target will be "this", which is an AttributeSource.
6. Make changes based on conditions found in step #3 above
7. set previous token = next token
8. return true

(Everywhere I say "token" I mean "instance of AttributeSource".)

The final token in the input stream will need special handling, as will

single-token input streams.

Good luck,
Steve

-----Original Message-----
From: Damerian [mailto:dameria...@gmail.com]
Sent: Thursday, February 09, 2012 2:19 PM
To: java-user@lucene.apache.org
Subject: Access next token in a stream

Hello i want to implement my custom filter, my wuestion is quite simple
but i cannot find a solution to it no matter how i try:

How can i access the TermAttribute of the  next token than the one i
currently have in my stream?

For example in  the phrase "My name is James Bond" if let's say i am in
the token [My], i would like to be able to check the TermAttribute of
the following token [name] and fix my position increment accordingly.

Thank you in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Hi Steve,
Thank you for your immediate reply. i will try your solution but i feel
that it does not solve my case.
What i am trying to make is a filter that joins together two
terms/tokens that start with a capital letter (it is trying to find all
the Names/Surnames and make them one token)  so in my aforementioned
example when i examine [James] even if i store the TermAttribute to a
temporary token how can i check the next one [Bond] , to join them
without actually emmiting (and therefore creating a term in my inverted
index) that has [James] on its own.
Thank you again for your insight and i would relly appreciate any other
views on the matter.

Regards, Damerian


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

I think my solution in almost full now only one question you mentioned

"clear the previous token. ". Is there a built-in method for doing that?In the begining i thought that if i put my new token into the sameposition increment it would "overwrite" the previous one , but what isucceeded was to simply inject code.. my method that does that so far isthis:


@Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken()) {
            return false;
        }

//Case were the previous token WAS NOT starting with capitalletter and the rest small

        if (previousTokenCanditateMainName == false) {
            if (CheckIfMainName(termAtt.term())) {
                previousTokenCanditateMainName = true;

tempString =this.termAtt.term(); /*This is the*///myToken.offsetAtt=this.offsetAtt; /*Token ineed to "delete"*/

                tempStartOffset = this.offsetAtt.startOffset();
                tempEndOffset = this.offsetAtt.endOffset();
                //this.nextInputStreamToken.clearAttributes();

                return true;
            } else {
                return true;
            }

} //Case were the previous token WAS a Proper name (startingwith Capital and continuiing with small letters)

        else {
            if (CheckIfMainName(termAtt.term())) {
                previousTokenCanditateMainName = false;
                posIncrAtt.setPositionIncrement(0);

String myString=tempString + TOKEN_SEPARATOR +this.termAtt.term();

//termAtt.setTermBuffer(myString, tempStartOffset,myString.length());termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +this.termAtt.term());offsetAtt.setOffset(tempStartOffset,this.offsetAtt.endOffset());

                return true;
            } else {
                previousTokenCanditateMainName = false;
                return true;
            }
        }

    }

The checkIfMain() method is a simple custom made method to decidewhether the token fullfills the criteria.

Once again thank you very much for your help, and the time that youspend in helping me


regards
/Damerian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Access next token in a stream

Reply via email to