Something does appear dodgy here. Using 3.4.0 the following very
simple code, with no custom classes
ShingleAnalyzerWrapper saw = new ShingleAnalyzerWrapper(LUCENE_34);
QueryParser qp = new QueryParser(LUCENE_34, "t", saw);
String s = "simple sentences rule";
Query q = qp.parse(s);
System.out.printf("%s parsed to %s\n", s, q);
produces
simple sentences rule parsed to t:simple t:sentences t:rule
Like you, I would have expected there to be some shingles in there.
Are we both missing something?
--
Ian.
On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin <[email protected]> wrote:
> Hi
>
> I have the following shinglefilter (Lucene 3.2)
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
> StandardTokenizer first = new
> StandardTokenizer(Version.LUCENE_32, reader);
> StandardFilter second = new
> StandardFilter(Version.LUCENE_32,first);
> LowerCaseFilter third = new
> LowerCaseFilter(Version.LUCENE_32,second);
> StopFilter fourth = new
> StopFilter(Version.LUCENE_32,third,Stopwords);
> PositionFilter fifth = new PositionFilter(fourth);
> ShingleFilter filter = new ShingleFilter(fifth,shingleSize);
> return filter;
> }
>
> that produces the following token stream given sentence
>
> "please parse this sentence into a shingle of size 2. I'll pay $2 for it"
>
> 1: [_ parse:7->12:shingle]
> 2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle]
> 3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle]
> 4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle]
> 5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle]
> 6: [2:50->51:<NUM>] [2 pay:50->61:shingle]
> 7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle]
> 8: [2:63->64:<NUM>]
>
> The query analyzer produces the following analyzed query for the field
> "titleShingled" for above sentence:
>
> ...... analyzed query:titleShingled:parse titleShingled:sentence
> titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay
> titleShingled:2
>
> As you can see there is no bigram singles in the query. I tried removing the
> unigrams from the token stream (using filter.setOutputUnigrams(false) in
> above shingles filter) but even though the singles seem to be fine the query
> is empty
>
>
> 1: [_ parse:7->12:shingle]
> 2: [parse sentence:7->26:shingle]
> 3: [sentence shingle:18->41:shingle]
> 4: [shingle size:34->49:shingle]
> 5: [size 2:45->51:shingle]
> 6: [2 pay:50->61:shingle]
> 7: [pay 2:58->64:shingle]
>
> ...... analyzed query:
>
> My goal is to index both unigrams and bigrams but first try to search on
> bigrams. I think it is the queryparser that is parsing the shingles in a
> manner that I am not understanding properly.
>
> QueryParser parser = new
> QueryParser(Version.LUCENE_32,"titleShingled",new
> ShinglesAnalyzer(2,Stopwords));
>
> Any help would be very much appreciated
>
> Peyman
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]