I've got an index building with the shingle filter and I can see the compound terms with Luke, etc. So far so good. One detail, I did tell it to not emit unigrams - I've got single words covered in a normal field.
And a bit of poking around the other day explained why shingle queries weren't working with the dismax handler in 1.4, also fine, I believe I understand now. But switching to the standard query handler, I still don't get proper multi-word shingle handling in any query, either via the web interface nor the various Java calls. I'm guessing it has to do with the order tokens are parsed in, but if so I'm not sure what the workaround is. Some things I've tried: Standard Solr query: ...&q=shingle_field:hello+world&debugQuery=true Standard Solr query, with the detault field set to the shingle field: ...&q=hello+world&debugQuery=true Standard Solr query, with the detault field set to the shingle field: ...&q="hello+world"&debugQuery=true I switched over to Java. Regular queries worked pretty easily, I could print them out. But attempts to conjure a shingle query always produce nothing. // fieldName = shingle field SolrQueryParser qp = new SolrQueryParser( schema, fieldName ); Query q = qp.parse( "hello world" ); System.out.println( "Query Object = " + q ); SolrQuery q = new SolrQuery(); q.addField( fieldName ); // Just setting a return field I think.... q.setQuery( "hello world" ); System.out.println( "Query Object = " + q ); // And I figured this one wouldn't work: SolrQueryParser qp = new SolrPluginUtils.DisjunctionMaxQueryParser( schema, fieldName ); SolrQuery q = qp.parse( "hello world" ); Query q = qp.parse( "hello world" ); System.out.println( "Query Object = " + q ); Looking at the constructors for org.apache.lucene.analysis.shingle.ShingleFilter they all seem to want a token stream, vs. a string. But I think the default query entry points into Solr are what's getting me to the single token at a time problem. I did verify that it's finding my schema, and if I put a non-existent field name in there, it certainly notices. I've tried with and without the PositionFilterFactory filter. If I comment out the shingle stage everything works. <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" stemEnglishPossessive="0" preserveOriginal="0" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/> <filter class="solr.PositionFilterFactory" /> </analyzer> </fieldType> -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513