Hello,
I am very confused about what ShingleFilter seems to be doing in Lucene
4.6. What I would like to do is extract all possible bigrams from a
sentence. So if the sentence is "This is a dog", I want "This is", "is a
", "a dog".
Here is my code:
StringTokenizer itr = new StringTokenizer(theText,"\n");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
ShingleAnalyzerWrapper shingleAnalyzer = new
ShingleAnalyzerWrapper(analyzer,2,2);
while (itr.hasMoreTokens()) {
String theSentence = itr.nextToken();
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = shingleAnalyzer.tokenStream("content",
reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute =
theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();
}
What I see in the output is this: suppose the sentence is "resting
comfortably and in no distress". I get the following output:
resting resting comfortably
resting comfortably comfortably
comfortably comfortably _
comfortably _ _ distress
_ distress distress
So it looks like not only do I not get bigrams, I get spurious 3-grams
by repeating words. Could someone please help?
Thanks much,
Natalia Connolly