Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
Damerian,

The technique I mentioned would work for you with a little tweaking: when you 
see consecutive capitalized tokens, then just set the CharTermAttribute to the 
joined tokens, and clear the previous token.

Another idea: you could use ShingleFilter with min size = max size = 2, and then use a 
following Filter extending FilteringTokenFilter, with an accept() method that examines 
shingles and rejects ones that don't qualify, something like the following.  (Notes: this 
is untested; I assume you will use the default shingle token separator " "; and 
this filter will reject all non-shingle terms, so you won't get anything but names, even 
if you configure ShingleFilter to emit single tokens):

public final class MyNameFilter extends FilteringTokenFilter {
   private static final Pattern NAME_PATTERN
       = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
   private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
   @Override public boolean accept() throws IOException {
     return NAME_PATTERN.matcher(termAtt).matches();
   }
}

Steve

-----Original Message-----
From: Damerian [mailto:dameria...@gmail.com]
Sent: Thursday, February 09, 2012 4:15 PM
To: java-user@lucene.apache.org
Subject: Re: Access next token in a stream

Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
Hi Damerian,

One way to handle your scenario is to hold on to the previous token, and
only emit a token after you reach at least the second token (or at end-of-
stream).  Your incrementToken() method could look something like:
1. Get current attributes: input.incrementToken()
2. If previous token does not exist:
        2a. Store current attributes as previous token (see
AttributeSource#cloneAttributes)
        2b. Get current attributes: input.incrementToken()
3. Check for&   store conditions that will affect previous token's
attributes
4. Store current attributes as next token (see
AttributeSource#cloneAttributes)
5. Copy previous token into current attributes (see
AttributeSource#copyTo);
     the target will be "this", which is an AttributeSource.
6. Make changes based on conditions found in step #3 above
7. set previous token = next token
8. return true

(Everywhere I say "token" I mean "instance of AttributeSource".)

The final token in the input stream will need special handling, as will
single-token input streams.
Good luck,
Steve

-----Original Message-----
From: Damerian [mailto:dameria...@gmail.com]
Sent: Thursday, February 09, 2012 2:19 PM
To: java-user@lucene.apache.org
Subject: Access next token in a stream

Hello i want to implement my custom filter, my wuestion is quite simple
but i cannot find a solution to it no matter how i try:

How can i access the TermAttribute of the  next token than the one i
currently have in my stream?

For example in  the phrase "My name is James Bond" if let's say i am in
the token [My], i would like to be able to check the TermAttribute of
the following token [name] and fix my position increment accordingly.

Thank you in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Hi Steve,
Thank you for your immediate reply. i will try your solution but i feel
that it does not solve my case.
What i am trying to make is a filter that joins together two
terms/tokens that start with a capital letter (it is trying to find all
the Names/Surnames and make them one token)  so in my aforementioned
example when i examine [James] even if i store the TermAttribute to a
temporary token how can i check the next one [Bond] , to join them
without actually emmiting (and therefore creating a term in my inverted
index) that has [James] on its own.
Thank you again for your insight and i would relly appreciate any other
views on the matter.

Regards, Damerian


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
I think my solution in almost full now only one question you mentioned
"clear the previous token. ". Is there a built-in method for doing that? In the begining i thought that if i put my new token into the same position increment it would "overwrite" the previous one , but what i succeeded was to simply inject code.. my method that does that so far is this:

@Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken()) {
            return false;
        }
//Case were the previous token WAS NOT starting with capital letter and the rest small
        if (previousTokenCanditateMainName == false) {
            if (CheckIfMainName(termAtt.term())) {
                previousTokenCanditateMainName = true;
tempString = this.termAtt.term(); /*This is the*/ // myToken.offsetAtt=this.offsetAtt; /*Token i need to "delete"*/
                tempStartOffset = this.offsetAtt.startOffset();
                tempEndOffset = this.offsetAtt.endOffset();
                //this.nextInputStreamToken.clearAttributes();

                return true;
            } else {
                return true;
            }
} //Case were the previous token WAS a Proper name (starting with Capital and continuiing with small letters)
        else {
            if (CheckIfMainName(termAtt.term())) {
                previousTokenCanditateMainName = false;
                posIncrAtt.setPositionIncrement(0);
String myString=tempString + TOKEN_SEPARATOR + this.termAtt.term();

//termAtt.setTermBuffer(myString, tempStartOffset, myString.length()); termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR + this.termAtt.term()); offsetAtt.setOffset(tempStartOffset, this.offsetAtt.endOffset());
                return true;
            } else {
                previousTokenCanditateMainName = false;
                return true;
            }
        }

    }

The checkIfMain() method is a simple custom made method to decide whether the token fullfills the criteria.

Once again thank you very much for your help, and the time that you spend in helping me

regards
/Damerian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to