Hi Robert,

   No, I did not… I just needed the filter to stop it from outputting
unigrams; otherwise I was getting "This", "this is", "is", "is a ", and so
on.   Is there another way I could do it?

   Thank you,

   Natalia



On Wed, Apr 2, 2014 at 2:40 PM, Robert Muir <rcm...@gmail.com> wrote:

> Did you really mean to shingle twice (shingleanalyzerwrapper just
> wraps the analyzer with a shinglefilter, then the code wraps that with
> another shinglefilter again) ?
>
> On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly
> <natalia.v.conno...@gmail.com> wrote:
> > Hello,
> >
> >    I am very confused about what ShingleFilter seems to be doing in
> Lucene
> > 4.6.  What I would like to do is extract all possible bigrams from a
> > sentence.  So if the sentence is "This is a dog", I want "This is", "is a
> > ", "a dog".
> >
> >     Here is my code:
> >
> >    StringTokenizer itr = new StringTokenizer(theText,"\n");
> >    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
> >    ShingleAnalyzerWrapper shingleAnalyzer = new
> > ShingleAnalyzerWrapper(analyzer,2,2);
> >
> >    while (itr.hasMoreTokens()) {
> >
> >     String theSentence = itr.nextToken();
> >     StringReader reader = new StringReader(theSentence);
> >     TokenStream tokenStream = shingleAnalyzer.tokenStream("content",
> > reader);
> >     ShingleFilter theFilter = new ShingleFilter(tokenStream);
> >     theFilter.setOutputUnigrams(false);
> >
> >     CharTermAttribute charTermAttribute =
> > theFilter.addAttribute(CharTermAttribute.class);
> >
> >     theFilter.reset();
> >
> >      while (theFilter.incrementToken()) {
> >
> >                 System.out.println(charTermAttribute.toString());
> >
> >      }
> >
> >      theFilter.end();
> >      theFilter.close();
> >   }
> >
> >
> >    What I see in the output is this: suppose the sentence is "resting
> > comfortably and in no distress".  I get the following output:
> >
> > resting resting comfortably
> > resting comfortably comfortably
> > comfortably comfortably _
> > comfortably _ _ distress
> > _ distress distress
> >
> >    So it looks like not only do I not get bigrams, I get spurious 3-grams
> > by repeating words.  Could someone please help?
> >
> >     Thanks much,
> >
> >     Natalia Connolly
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to