RE: Access next token in a stream

Steven A Rowe Thu, 09 Feb 2012 14:13:03 -0800

Damerian,

When I said "clear the previous token", I was referring to the pseudo-code I 
gave in my first response to you.  There is no built-in method to do that.  If 
you want to conditionally output tokens, you should store AttributeSource 
clones, as in my pseudo-code.


Steve

> -----Original Message-----
> From: Damerian [mailto:dameria...@gmail.com]
> Sent: Thursday, February 09, 2012 5:00 PM
> To: java-user@lucene.apache.org
> Subject: Re: Access next token in a stream
> 
> Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
> > Damerian,
> >
> > The technique I mentioned would work for you with a little tweaking:
> when you see consecutive capitalized tokens, then just set the
> CharTermAttribute to the joined tokens, and clear the previous token.
> >
> > Another idea: you could use ShingleFilter with min size = max size = 2,
> and then use a following Filter extending FilteringTokenFilter, with an
> accept() method that examines shingles and rejects ones that don't
> qualify, something like the following.  (Notes: this is untested; I assume
> you will use the default shingle token separator " "; and this filter will
> reject all non-shingle terms, so you won't get anything but names, even if
> you configure ShingleFilter to emit single tokens):
> >
> > public final class MyNameFilter extends FilteringTokenFilter {
> >    private static final Pattern NAME_PATTERN
> >        = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
> >    private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
> >    @Override public boolean accept() throws IOException {
> >      return NAME_PATTERN.matcher(termAtt).matches();
> >    }
> > }
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Damerian [mailto:dameria...@gmail.com]
> >> Sent: Thursday, February 09, 2012 4:15 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: Access next token in a stream
> >>
> >> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
> >>> Hi Damerian,
> >>>
> >>> One way to handle your scenario is to hold on to the previous token,
> and
> >> only emit a token after you reach at least the second token (or at end-
> of-
> >> stream).  Your incrementToken() method could look something like:
> >>> 1. Get current attributes: input.incrementToken()
> >>> 2. If previous token does not exist:
> >>>         2a. Store current attributes as previous token (see
> >> AttributeSource#cloneAttributes)
> >>>   2b. Get current attributes: input.incrementToken()
> >>> 3. Check for&   store conditions that will affect previous token's
> >> attributes
> >>> 4. Store current attributes as next token (see
> >> AttributeSource#cloneAttributes)
> >>> 5. Copy previous token into current attributes (see
> >> AttributeSource#copyTo);
> >>>      the target will be "this", which is an AttributeSource.
> >>> 6. Make changes based on conditions found in step #3 above
> >>> 7. set previous token = next token
> >>> 8. return true
> >>>
> >>> (Everywhere I say "token" I mean "instance of AttributeSource".)
> >>>
> >>> The final token in the input stream will need special handling, as
> will
> >> single-token input streams.
> >>> Good luck,
> >>> Steve
> >>>
> >>>> -----Original Message-----
> >>>> From: Damerian [mailto:dameria...@gmail.com]
> >>>> Sent: Thursday, February 09, 2012 2:19 PM
> >>>> To: java-user@lucene.apache.org
> >>>> Subject: Access next token in a stream
> >>>>
> >>>> Hello i want to implement my custom filter, my wuestion is quite
> simple
> >>>> but i cannot find a solution to it no matter how i try:
> >>>>
> >>>> How can i access the TermAttribute of the  next token than the one i
> >>>> currently have in my stream?
> >>>>
> >>>> For example in  the phrase "My name is James Bond" if let's say i am
> in
> >>>> the token [My], i would like to be able to check the TermAttribute of
> >>>> the following token [name] and fix my position increment accordingly.
> >>>>
> >>>> Thank you in advance!
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> Hi Steve,
> >> Thank you for your immediate reply. i will try your solution but i feel
> >> that it does not solve my case.
> >> What i am trying to make is a filter that joins together two
> >> terms/tokens that start with a capital letter (it is trying to find all
> >> the Names/Surnames and make them one token)  so in my aforementioned
> >> example when i examine [James] even if i store the TermAttribute to a
> >> temporary token how can i check the next one [Bond] , to join them
> >> without actually emmiting (and therefore creating a term in my inverted
> >> index) that has [James] on its own.
> >> Thank you again for your insight and i would relly appreciate any other
> >> views on the matter.
> >>
> >> Regards, Damerian
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> I think my solution in almost full now only one question you mentioned
> "clear the previous token. ". Is there a built-in method for doing that?
> In the begining i thought that if i put my new token into the same
> position increment it would "overwrite" the previous one , but what i
> succeeded was to simply inject code.. my method that does that so far is
> this:
> 
> @Override
>      public boolean incrementToken() throws IOException {
>          if (!input.incrementToken()) {
>              return false;
>          }
>          //Case were the previous token WAS NOT starting with capital
> letter and the rest small
>          if (previousTokenCanditateMainName == false) {
>              if (CheckIfMainName(termAtt.term())) {
>                  previousTokenCanditateMainName = true;
>                  tempString =
> this.termAtt.term();                           /*This is the*/
>                  //
> myToken.offsetAtt=this.offsetAtt;                             /*Token i
> need to "delete"*/
>                  tempStartOffset = this.offsetAtt.startOffset();
>                  tempEndOffset = this.offsetAtt.endOffset();
>                  //this.nextInputStreamToken.clearAttributes();
> 
>                  return true;
>              } else {
>                  return true;
>              }
>          } //Case were the previous token WAS a Proper name (starting
> with Capital and continuiing with small letters)
>          else {
>              if (CheckIfMainName(termAtt.term())) {
>                  previousTokenCanditateMainName = false;
>                  posIncrAtt.setPositionIncrement(0);
>                  String myString=tempString + TOKEN_SEPARATOR +
> this.termAtt.term();
> 
>                  //termAtt.setTermBuffer(myString, tempStartOffset,
> myString.length());
>                  termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +
> this.termAtt.term());
>                  offsetAtt.setOffset(tempStartOffset,
> this.offsetAtt.endOffset());
>                  return true;
>              } else {
>                  previousTokenCanditateMainName = false;
>                  return true;
>              }
>          }
> 
>      }
> 
> The checkIfMain() method is a simple custom made method to decide
> whether the token fullfills the criteria.
> 
> Once again thank you very much for your help, and the time that you
> spend in helping me
> 
> regards
> /Damerian
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Access next token in a stream

Reply via email to