Damerian,
The technique I mentioned would work for you with a little tweaking: when you
see consecutive capitalized tokens, then just set the CharTermAttribute to the
joined tokens, and clear the previous token.
Another idea: you could use ShingleFilter with min size = max size = 2, and
then use a following Filter extending FilteringTokenFilter, with an accept()
method that examines shingles and rejects ones that don't qualify, something
like the following. (Notes: this is untested; I assume you will use the
default shingle token separator " "; and this filter will reject all
non-shingle terms, so you won't get anything but names, even if you configure
ShingleFilter to emit single tokens):
public final class MyNameFilter extends FilteringTokenFilter {
private static final Pattern NAME_PATTERN
= Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
@Override public boolean accept() throws IOException {
return NAME_PATTERN.matcher(termAtt).matches();
}
}
Steve
> -----Original Message-----
> From: Damerian [mailto:[email protected]]
> Sent: Thursday, February 09, 2012 4:15 PM
> To: [email protected]
> Subject: Re: Access next token in a stream
>
> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
> > Hi Damerian,
> >
> > One way to handle your scenario is to hold on to the previous token, and
> only emit a token after you reach at least the second token (or at end-of-
> stream). Your incrementToken() method could look something like:
> >
> > 1. Get current attributes: input.incrementToken()
> > 2. If previous token does not exist:
> > 2a. Store current attributes as previous token (see
> AttributeSource#cloneAttributes)
> > 2b. Get current attributes: input.incrementToken()
> > 3. Check for& store conditions that will affect previous token's
> attributes
> > 4. Store current attributes as next token (see
> AttributeSource#cloneAttributes)
> > 5. Copy previous token into current attributes (see
> AttributeSource#copyTo);
> > the target will be "this", which is an AttributeSource.
> > 6. Make changes based on conditions found in step #3 above
> > 7. set previous token = next token
> > 8. return true
> >
> > (Everywhere I say "token" I mean "instance of AttributeSource".)
> >
> > The final token in the input stream will need special handling, as will
> single-token input streams.
> >
> > Good luck,
> > Steve
> >
> >> -----Original Message-----
> >> From: Damerian [mailto:[email protected]]
> >> Sent: Thursday, February 09, 2012 2:19 PM
> >> To: [email protected]
> >> Subject: Access next token in a stream
> >>
> >> Hello i want to implement my custom filter, my wuestion is quite simple
> >> but i cannot find a solution to it no matter how i try:
> >>
> >> How can i access the TermAttribute of the next token than the one i
> >> currently have in my stream?
> >>
> >> For example in the phrase "My name is James Bond" if let's say i am in
> >> the token [My], i would like to be able to check the TermAttribute of
> >> the following token [name] and fix my position increment accordingly.
> >>
> >> Thank you in advance!
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> Hi Steve,
> Thank you for your immediate reply. i will try your solution but i feel
> that it does not solve my case.
> What i am trying to make is a filter that joins together two
> terms/tokens that start with a capital letter (it is trying to find all
> the Names/Surnames and make them one token) so in my aforementioned
> example when i examine [James] even if i store the TermAttribute to a
> temporary token how can i check the next one [Bond] , to join them
> without actually emmiting (and therefore creating a term in my inverted
> index) that has [James] on its own.
> Thank you again for your insight and i would relly appreciate any other
> views on the matter.
>
> Regards, Damerian
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]