Hello, I want to align the output of two different analysis pipelines, but I don't know how. We are using Lucene for text analysis. First, every input text is normalized using StandardTokenizer, StandardFilter and LowerCaseFilter. This yields a list of tokens (list1). Second, the same input text is also stemmed and stopwords are removed, yielding list2:
list1: [this text contains stopwords i need to align them]
list2: [---- text contain stopword -- need -- align ----]
If I want to align both lists, I need to know which tokens were removed
by the StopFilter. The following code works, but not for the last token
("them"):
while (tokenStream.incrementToken()) {
int skippedTokens =
= tokenStream.getAttribute(PositionIncrementAttribute.class)
.getPositionIncrement() - 1;
// process the current token, e.g. we know that "need" is the 6th
// element in the list because the previous token was removed
}
For stopwords that are at the end of the tokenStream (e.g. "them"), the
positionIncrement is not updated - after leaving the while-loop,
skippedTokens is 0. My workaround is to append a unique number to every
input text, so that every text ends with a non-stopword. Can you think
of a more reasonable approach?
Thank you,
Hannes
signature.asc
Description: This is a digitally signed message part
