Strange behavior of ShingleFilter in Lucene 4.6

Natalia Connolly Wed, 02 Apr 2014 10:43:27 -0700

Hello,

   I am very confused about what ShingleFilter seems to be doing in Lucene
4.6.  What I would like to do is extract all possible bigrams from a
sentence.  So if the sentence is "This is a dog", I want "This is", "is a
", "a dog".


    Here is my code:

   StringTokenizer itr = new StringTokenizer(theText,"\n");
   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
   ShingleAnalyzerWrapper shingleAnalyzer = new
ShingleAnalyzerWrapper(analyzer,2,2);

   while (itr.hasMoreTokens()) {

    String theSentence = itr.nextToken();
    StringReader reader = new StringReader(theSentence);
    TokenStream tokenStream = shingleAnalyzer.tokenStream("content",
reader);
    ShingleFilter theFilter = new ShingleFilter(tokenStream);
    theFilter.setOutputUnigrams(false);

    CharTermAttribute charTermAttribute =
theFilter.addAttribute(CharTermAttribute.class);

    theFilter.reset();

     while (theFilter.incrementToken()) {

                System.out.println(charTermAttribute.toString());

     }

     theFilter.end();
     theFilter.close();
  }


   What I see in the output is this: suppose the sentence is "resting
comfortably and in no distress".  I get the following output:

resting resting comfortably
resting comfortably comfortably
comfortably comfortably _
comfortably _ _ distress
_ distress distress

   So it looks like not only do I not get bigrams, I get spurious 3-grams
by repeating words.  Could someone please help?

    Thanks much,

    Natalia Connolly

Strange behavior of ShingleFilter in Lucene 4.6

Reply via email to