Hi Ian
i think i found out the problem (from tests here
http://www.devdaily.com/java/jwarehouse/lucene/contrib/analyzers/common/src/test/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapperTest.java.shtml)
if you generate the query as a BooleanQuery then it seems to work. The
following works:
BooleanQuery query =
getShingleBooleanQuery(analyzer,title,fieldToSearch);
TopDocs hits = searcher.search(query, 10);
where
private static BooleanQuery getShingleBooleanQuery(Analyzer analyzer,
String qs, String fieldToSearch) throws Exception {
BooleanQuery q = new BooleanQuery();
TokenStream ts = analyzer.tokenStream(fieldToSearch,new
StringReader(qs));
CharTermAttribute termAtt =
ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String termText = termAtt.toString();
q.add(new TermQuery(new Term(fieldToSearch,
termText)),BooleanClause.Occur.SHOULD);
}
System.out.println("... parsed query: " + q);
return q;
}
Thank you (again) for your help
Peyman
On Oct 11, 2011, at 3:51 PM, Ian Lea wrote:
> Something does appear dodgy here. Using 3.4.0 the following very
> simple code, with no custom classes
>
> ShingleAnalyzerWrapper saw = new ShingleAnalyzerWrapper(LUCENE_34);
> QueryParser qp = new QueryParser(LUCENE_34, "t", saw);
> String s = "simple sentences rule";
> Query q = qp.parse(s);
> System.out.printf("%s parsed to %s\n", s, q);
>
> produces
>
> simple sentences rule parsed to t:simple t:sentences t:rule
>
> Like you, I would have expected there to be some shingles in there.
> Are we both missing something?
>
>
> --
> Ian.
>
>
> On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin <[email protected]>
> wrote:
>> Hi
>>
>> I have the following shinglefilter (Lucene 3.2)
>>
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>> StandardTokenizer first = new
>> StandardTokenizer(Version.LUCENE_32, reader);
>> StandardFilter second = new
>> StandardFilter(Version.LUCENE_32,first);
>> LowerCaseFilter third = new
>> LowerCaseFilter(Version.LUCENE_32,second);
>> StopFilter fourth = new
>> StopFilter(Version.LUCENE_32,third,Stopwords);
>> PositionFilter fifth = new PositionFilter(fourth);
>> ShingleFilter filter = new ShingleFilter(fifth,shingleSize);
>> return filter;
>> }
>>
>> that produces the following token stream given sentence
>>
>> "please parse this sentence into a shingle of size 2. I'll pay $2 for it"
>>
>> 1: [_ parse:7->12:shingle]
>> 2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle]
>> 3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle]
>> 4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle]
>> 5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle]
>> 6: [2:50->51:<NUM>] [2 pay:50->61:shingle]
>> 7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle]
>> 8: [2:63->64:<NUM>]
>>
>> The query analyzer produces the following analyzed query for the field
>> "titleShingled" for above sentence:
>>
>> ...... analyzed query:titleShingled:parse titleShingled:sentence
>> titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay
>> titleShingled:2
>>
>> As you can see there is no bigram singles in the query. I tried removing the
>> unigrams from the token stream (using filter.setOutputUnigrams(false) in
>> above shingles filter) but even though the singles seem to be fine the query
>> is empty
>>
>>
>> 1: [_ parse:7->12:shingle]
>> 2: [parse sentence:7->26:shingle]
>> 3: [sentence shingle:18->41:shingle]
>> 4: [shingle size:34->49:shingle]
>> 5: [size 2:45->51:shingle]
>> 6: [2 pay:50->61:shingle]
>> 7: [pay 2:58->64:shingle]
>>
>> ...... analyzed query:
>>
>> My goal is to index both unigrams and bigrams but first try to search on
>> bigrams. I think it is the queryparser that is parsing the shingles in a
>> manner that I am not understanding properly.
>>
>> QueryParser parser = new
>> QueryParser(Version.LUCENE_32,"titleShingled",new
>> ShinglesAnalyzer(2,Stopwords));
>>
>> Any help would be very much appreciated
>>
>> Peyman
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>