Hi Robert, No, I did not… I just needed the filter to stop it from outputting unigrams; otherwise I was getting "This", "this is", "is", "is a ", and so on. Is there another way I could do it?
Thank you, Natalia On Wed, Apr 2, 2014 at 2:40 PM, Robert Muir <rcm...@gmail.com> wrote: > Did you really mean to shingle twice (shingleanalyzerwrapper just > wraps the analyzer with a shinglefilter, then the code wraps that with > another shinglefilter again) ? > > On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly > <natalia.v.conno...@gmail.com> wrote: > > Hello, > > > > I am very confused about what ShingleFilter seems to be doing in > Lucene > > 4.6. What I would like to do is extract all possible bigrams from a > > sentence. So if the sentence is "This is a dog", I want "This is", "is a > > ", "a dog". > > > > Here is my code: > > > > StringTokenizer itr = new StringTokenizer(theText,"\n"); > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); > > ShingleAnalyzerWrapper shingleAnalyzer = new > > ShingleAnalyzerWrapper(analyzer,2,2); > > > > while (itr.hasMoreTokens()) { > > > > String theSentence = itr.nextToken(); > > StringReader reader = new StringReader(theSentence); > > TokenStream tokenStream = shingleAnalyzer.tokenStream("content", > > reader); > > ShingleFilter theFilter = new ShingleFilter(tokenStream); > > theFilter.setOutputUnigrams(false); > > > > CharTermAttribute charTermAttribute = > > theFilter.addAttribute(CharTermAttribute.class); > > > > theFilter.reset(); > > > > while (theFilter.incrementToken()) { > > > > System.out.println(charTermAttribute.toString()); > > > > } > > > > theFilter.end(); > > theFilter.close(); > > } > > > > > > What I see in the output is this: suppose the sentence is "resting > > comfortably and in no distress". I get the following output: > > > > resting resting comfortably > > resting comfortably comfortably > > comfortably comfortably _ > > comfortably _ _ distress > > _ distress distress > > > > So it looks like not only do I not get bigrams, I get spurious 3-grams > > by repeating words. Could someone please help? > > > > Thanks much, > > > > Natalia Connolly > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >