Damerian, When I said "clear the previous token", I was referring to the pseudo-code I gave in my first response to you. There is no built-in method to do that. If you want to conditionally output tokens, you should store AttributeSource clones, as in my pseudo-code.
Steve > -----Original Message----- > From: Damerian [mailto:dameria...@gmail.com] > Sent: Thursday, February 09, 2012 5:00 PM > To: java-user@lucene.apache.org > Subject: Re: Access next token in a stream > > Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε: > > Damerian, > > > > The technique I mentioned would work for you with a little tweaking: > when you see consecutive capitalized tokens, then just set the > CharTermAttribute to the joined tokens, and clear the previous token. > > > > Another idea: you could use ShingleFilter with min size = max size = 2, > and then use a following Filter extending FilteringTokenFilter, with an > accept() method that examines shingles and rejects ones that don't > qualify, something like the following. (Notes: this is untested; I assume > you will use the default shingle token separator " "; and this filter will > reject all non-shingle terms, so you won't get anything but names, even if > you configure ShingleFilter to emit single tokens): > > > > public final class MyNameFilter extends FilteringTokenFilter { > > private static final Pattern NAME_PATTERN > > = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+"); > > private final CharTermAttribute termAtt = > addAttribute(CharTermAttribute.class); > > @Override public boolean accept() throws IOException { > > return NAME_PATTERN.matcher(termAtt).matches(); > > } > > } > > > > Steve > > > >> -----Original Message----- > >> From: Damerian [mailto:dameria...@gmail.com] > >> Sent: Thursday, February 09, 2012 4:15 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: Access next token in a stream > >> > >> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: > >>> Hi Damerian, > >>> > >>> One way to handle your scenario is to hold on to the previous token, > and > >> only emit a token after you reach at least the second token (or at end- > of- > >> stream). Your incrementToken() method could look something like: > >>> 1. Get current attributes: input.incrementToken() > >>> 2. If previous token does not exist: > >>> 2a. Store current attributes as previous token (see > >> AttributeSource#cloneAttributes) > >>> 2b. Get current attributes: input.incrementToken() > >>> 3. Check for& store conditions that will affect previous token's > >> attributes > >>> 4. Store current attributes as next token (see > >> AttributeSource#cloneAttributes) > >>> 5. Copy previous token into current attributes (see > >> AttributeSource#copyTo); > >>> the target will be "this", which is an AttributeSource. > >>> 6. Make changes based on conditions found in step #3 above > >>> 7. set previous token = next token > >>> 8. return true > >>> > >>> (Everywhere I say "token" I mean "instance of AttributeSource".) > >>> > >>> The final token in the input stream will need special handling, as > will > >> single-token input streams. > >>> Good luck, > >>> Steve > >>> > >>>> -----Original Message----- > >>>> From: Damerian [mailto:dameria...@gmail.com] > >>>> Sent: Thursday, February 09, 2012 2:19 PM > >>>> To: java-user@lucene.apache.org > >>>> Subject: Access next token in a stream > >>>> > >>>> Hello i want to implement my custom filter, my wuestion is quite > simple > >>>> but i cannot find a solution to it no matter how i try: > >>>> > >>>> How can i access the TermAttribute of the next token than the one i > >>>> currently have in my stream? > >>>> > >>>> For example in the phrase "My name is James Bond" if let's say i am > in > >>>> the token [My], i would like to be able to check the TermAttribute of > >>>> the following token [name] and fix my position increment accordingly. > >>>> > >>>> Thank you in advance! > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> Hi Steve, > >> Thank you for your immediate reply. i will try your solution but i feel > >> that it does not solve my case. > >> What i am trying to make is a filter that joins together two > >> terms/tokens that start with a capital letter (it is trying to find all > >> the Names/Surnames and make them one token) so in my aforementioned > >> example when i examine [James] even if i store the TermAttribute to a > >> temporary token how can i check the next one [Bond] , to join them > >> without actually emmiting (and therefore creating a term in my inverted > >> index) that has [James] on its own. > >> Thank you again for your insight and i would relly appreciate any other > >> views on the matter. > >> > >> Regards, Damerian > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > I think my solution in almost full now only one question you mentioned > "clear the previous token. ". Is there a built-in method for doing that? > In the begining i thought that if i put my new token into the same > position increment it would "overwrite" the previous one , but what i > succeeded was to simply inject code.. my method that does that so far is > this: > > @Override > public boolean incrementToken() throws IOException { > if (!input.incrementToken()) { > return false; > } > //Case were the previous token WAS NOT starting with capital > letter and the rest small > if (previousTokenCanditateMainName == false) { > if (CheckIfMainName(termAtt.term())) { > previousTokenCanditateMainName = true; > tempString = > this.termAtt.term(); /*This is the*/ > // > myToken.offsetAtt=this.offsetAtt; /*Token i > need to "delete"*/ > tempStartOffset = this.offsetAtt.startOffset(); > tempEndOffset = this.offsetAtt.endOffset(); > //this.nextInputStreamToken.clearAttributes(); > > return true; > } else { > return true; > } > } //Case were the previous token WAS a Proper name (starting > with Capital and continuiing with small letters) > else { > if (CheckIfMainName(termAtt.term())) { > previousTokenCanditateMainName = false; > posIncrAtt.setPositionIncrement(0); > String myString=tempString + TOKEN_SEPARATOR + > this.termAtt.term(); > > //termAtt.setTermBuffer(myString, tempStartOffset, > myString.length()); > termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR + > this.termAtt.term()); > offsetAtt.setOffset(tempStartOffset, > this.offsetAtt.endOffset()); > return true; > } else { > previousTokenCanditateMainName = false; > return true; > } > } > > } > > The checkIfMain() method is a simple custom made method to decide > whether the token fullfills the criteria. > > Once again thank you very much for your help, and the time that you > spend in helping me > > regards > /Damerian > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org