Shingles Filter problems

Peyman Faratin Tue, 11 Oct 2011 07:26:04 -0700

Hi

I have the following shinglefilter (Lucene 3.2)


          public TokenStream tokenStream(String fieldName, Reader reader) {
                  StandardTokenizer first = new 
StandardTokenizer(Version.LUCENE_32, reader);
                  StandardFilter second = new 
StandardFilter(Version.LUCENE_32,first);
                  LowerCaseFilter third = new 
LowerCaseFilter(Version.LUCENE_32,second);
                  StopFilter fourth = new 
StopFilter(Version.LUCENE_32,third,Stopwords);
                  PositionFilter fifth = new PositionFilter(fourth);
                  ShingleFilter filter = new ShingleFilter(fifth,shingleSize);  
          
                   return filter;
                }

that produces the following token stream given sentence

"please parse this sentence into a shingle of size 2. I'll pay $2 for it"

1: [_ parse:7->12:shingle] 
2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle] 
3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle] 
4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle] 
5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle] 
6: [2:50->51:<NUM>] [2 pay:50->61:shingle] 
7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle] 
8: [2:63->64:<NUM>] 

The query analyzer produces the following analyzed query for the field 
"titleShingled" for above sentence: 

...... analyzed query:titleShingled:parse titleShingled:sentence 
titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay 
titleShingled:2

As you can see there is no bigram singles in the query. I tried removing the 
unigrams from the token stream (using  filter.setOutputUnigrams(false) in above 
shingles filter) but even though the singles seem to be fine the query is empty


1: [_ parse:7->12:shingle] 
2: [parse sentence:7->26:shingle] 
3: [sentence shingle:18->41:shingle] 
4: [shingle size:34->49:shingle] 
5: [size 2:45->51:shingle] 
6: [2 pay:50->61:shingle] 
7: [pay 2:58->64:shingle] 

...... analyzed query: 

My goal is to index both unigrams and bigrams but first try to search on 
bigrams. I think it is the queryparser that is parsing the shingles in a manner 
that I am not understanding properly. 

                  QueryParser parser = new 
QueryParser(Version.LUCENE_32,"titleShingled",new 
ShinglesAnalyzer(2,Stopwords));

Any help would be very much appreciated

Peyman

Shingles Filter problems

Reply via email to