Hi
I have the following shinglefilter (Lucene 3.2)
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer first = new
StandardTokenizer(Version.LUCENE_32, reader);
StandardFilter second = new
StandardFilter(Version.LUCENE_32,first);
LowerCaseFilter third = new
LowerCaseFilter(Version.LUCENE_32,second);
StopFilter fourth = new
StopFilter(Version.LUCENE_32,third,Stopwords);
PositionFilter fifth = new PositionFilter(fourth);
ShingleFilter filter = new ShingleFilter(fifth,shingleSize);
return filter;
}
that produces the following token stream given sentence
"please parse this sentence into a shingle of size 2. I'll pay $2 for it"
1: [_ parse:7->12:shingle]
2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle]
3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle]
4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle]
5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle]
6: [2:50->51:<NUM>] [2 pay:50->61:shingle]
7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle]
8: [2:63->64:<NUM>]
The query analyzer produces the following analyzed query for the field
"titleShingled" for above sentence:
...... analyzed query:titleShingled:parse titleShingled:sentence
titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay
titleShingled:2
As you can see there is no bigram singles in the query. I tried removing the
unigrams from the token stream (using filter.setOutputUnigrams(false) in above
shingles filter) but even though the singles seem to be fine the query is empty
1: [_ parse:7->12:shingle]
2: [parse sentence:7->26:shingle]
3: [sentence shingle:18->41:shingle]
4: [shingle size:34->49:shingle]
5: [size 2:45->51:shingle]
6: [2 pay:50->61:shingle]
7: [pay 2:58->64:shingle]
...... analyzed query:
My goal is to index both unigrams and bigrams but first try to search on
bigrams. I think it is the queryparser that is parsing the shingles in a manner
that I am not understanding properly.
QueryParser parser = new
QueryParser(Version.LUCENE_32,"titleShingled",new
ShinglesAnalyzer(2,Stopwords));
Any help would be very much appreciated
Peyman