On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe <[email protected]> wrote:
> A thought: one way to do #1 without modifying ShingleFilter: if there were a
> StopFilter variant that accepted regular expressions instead of a stopword
> list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a
> full match is required, i.e. implicit beginning and end anchors), and place
> it in the analysis pipeline after ShingleFilter to throw out shingles with
> filler tokens in them.
>
> (It think it would be useful to generalize StopFilter to allow for more
> sources of stoppage, rather than just creating a StopRegexFilter with no
> relation to StopFilter.)
>
we already did this in 3.1 by making a base FilteringTokenFilter class?
a regex filter is trivial if you subclass this (we could add something
like this untested code to the .pattern package or whatever)
public class PatternRemoveFilter extends FilteringTokenFilter {
private final Matcher matcher;
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
public PatternRemoveFilter(boolean enablePositionIncrements,
TokenStream input, Pattern pattern) {
super(enablePositionIncrements, input);
matcher = pattern.matcher(termAtt);
}
@Override
protected boolean accept() throws IOException {
matcher.reset();
return !matcher.matches();
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]