The underscores are due to the fact that the StopFilter defaults to "enable position increments", so there are no terms at the positions where the stop words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is "final" so you can't subclass it to override the "createComponents" method that creates the StopFilter, so you would essentially have to copy the source for SnowballAnalyzer and then add in the code to invoke StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-----Original Message----- From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to