Something does appear dodgy here. Using 3.4.0 the following very simple code, with no custom classes
ShingleAnalyzerWrapper saw = new ShingleAnalyzerWrapper(LUCENE_34); QueryParser qp = new QueryParser(LUCENE_34, "t", saw); String s = "simple sentences rule"; Query q = qp.parse(s); System.out.printf("%s parsed to %s\n", s, q); produces simple sentences rule parsed to t:simple t:sentences t:rule Like you, I would have expected there to be some shingles in there. Are we both missing something? -- Ian. On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin <pey...@robustlinks.com> wrote: > Hi > > I have the following shinglefilter (Lucene 3.2) > > public TokenStream tokenStream(String fieldName, Reader reader) { > StandardTokenizer first = new > StandardTokenizer(Version.LUCENE_32, reader); > StandardFilter second = new > StandardFilter(Version.LUCENE_32,first); > LowerCaseFilter third = new > LowerCaseFilter(Version.LUCENE_32,second); > StopFilter fourth = new > StopFilter(Version.LUCENE_32,third,Stopwords); > PositionFilter fifth = new PositionFilter(fourth); > ShingleFilter filter = new ShingleFilter(fifth,shingleSize); > return filter; > } > > that produces the following token stream given sentence > > "please parse this sentence into a shingle of size 2. I'll pay $2 for it" > > 1: [_ parse:7->12:shingle] > 2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle] > 3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle] > 4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle] > 5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle] > 6: [2:50->51:<NUM>] [2 pay:50->61:shingle] > 7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle] > 8: [2:63->64:<NUM>] > > The query analyzer produces the following analyzed query for the field > "titleShingled" for above sentence: > > ...... analyzed query:titleShingled:parse titleShingled:sentence > titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay > titleShingled:2 > > As you can see there is no bigram singles in the query. I tried removing the > unigrams from the token stream (using filter.setOutputUnigrams(false) in > above shingles filter) but even though the singles seem to be fine the query > is empty > > > 1: [_ parse:7->12:shingle] > 2: [parse sentence:7->26:shingle] > 3: [sentence shingle:18->41:shingle] > 4: [shingle size:34->49:shingle] > 5: [size 2:45->51:shingle] > 6: [2 pay:50->61:shingle] > 7: [pay 2:58->64:shingle] > > ...... analyzed query: > > My goal is to index both unigrams and bigrams but first try to search on > bigrams. I think it is the queryparser that is parsing the shingles in a > manner that I am not understanding properly. > > QueryParser parser = new > QueryParser(Version.LUCENE_32,"titleShingled",new > ShinglesAnalyzer(2,Stopwords)); > > Any help would be very much appreciated > > Peyman > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org