Aligning text analyses, with and without stopwords

Johannes Neubarth Thu, 26 Jul 2012 09:17:30 -0700

Hello,
I want to align the output of two different analysis pipelines, but I
don't know how.
We are using Lucene for text analysis. First, every input text is
normalized using StandardTokenizer, StandardFilter and LowerCaseFilter.
This yields a list of tokens (list1). Second, the same input text is
also stemmed and stopwords are removed, yielding list2:


list1: [this text contains stopwords i need to align them]
list2: [---- text contain  stopword -- need -- align ----]

If I want to align both lists, I need to know which tokens were removed
by the StopFilter. The following code works, but not for the last token
("them"):

while (tokenStream.incrementToken()) {
    int skippedTokens =
        = tokenStream.getAttribute(PositionIncrementAttribute.class)
          .getPositionIncrement() - 1;
    // process the current token, e.g. we know that "need" is the 6th
    // element in the list because the previous token was removed
}

For stopwords that are at the end of the tokenStream (e.g. "them"), the
positionIncrement is not updated - after leaving the while-loop,
skippedTokens is 0. My workaround is to append a unique number to every
input text, so that every text ends with a non-stopword. Can you think
of a more reasonable approach?

Thank you,
Hannes

signature.asc
Description: This is a digitally signed message part

Aligning text analyses, with and without stopwords

Reply via email to