I am using lucene 2.9.1 and I was trying to understand the ShingleFilter and
wrote the code below.
String test = "please divide this sentence";
Tokenizer wsTokenizer = new WhitespaceTokenizer(new StringReader(test));
ShingleFilter filter = new ShingleFilter(wsTokenizer, 3);
filter.setOutputUnigrams(false);
TermAttribute termAtt = (TermAttribute)
filter.getAttribute(TermAttribute.class);
while (filter.incrementToken()) System.out.println(termAtt.term());
I noticed that if I set outputUnigrams to false it gives me the same output for
maxShingleSize=2 and maxShingleSize=3.
please divide
divide this
this sentence
when i set maxShingleSize to 4 output is:
please divide
please divide this sentence
divide this
this sentence
I was expecting the output as follows with maxShingleSize=3 and
outputUnigrams=false :
please divide this
divide this sentence
Am I missing something or this is the expected behavior?
I checked source code of ShingleFilterTest (lucene 3.0.0) and see that
TRI_GRAM_TOKENS are tested with only outputUnigrams=true but not with
outputUnigrams=false.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]