Trouble with Shingle filter and query parsing / expansion

Mark Bennett Tue, 11 Aug 2009 09:17:54 -0700

I've got an index building with the shingle filter and I can see the
compound terms with Luke, etc.  So far so good.  One detail, I did tell it
to not emit unigrams - I've got single words covered in a normal field.


And a bit of poking around the other day explained why shingle queries
weren't working with the dismax handler in 1.4, also fine, I believe I
understand now.

But switching to the standard query handler, I still don't get proper
multi-word shingle handling in any query, either via the web interface nor
the various Java calls.  I'm guessing it has to do with the order tokens are
parsed in, but if so I'm not sure what the workaround is.

Some things I've tried:

Standard Solr query:
...&q=shingle_field:hello+world&debugQuery=true

Standard Solr query, with the detault field set to the shingle field:
...&q=hello+world&debugQuery=true

Standard Solr query, with the detault field set to the shingle field:
...&q="hello+world"&debugQuery=true

I switched over to Java.  Regular queries worked pretty easily, I could
print them out.  But attempts to conjure a shingle query always produce
nothing.

// fieldName = shingle field
SolrQueryParser qp = new SolrQueryParser( schema, fieldName );
Query q = qp.parse( "hello world" );
System.out.println( "Query Object = " + q );

SolrQuery q = new SolrQuery();
q.addField( fieldName );  // Just setting a return field I think....
q.setQuery( "hello world" );
System.out.println( "Query Object = " + q );

// And I figured this one wouldn't work:
SolrQueryParser qp = new SolrPluginUtils.DisjunctionMaxQueryParser(
                     schema, fieldName );
SolrQuery q = qp.parse( "hello world" );
Query q = qp.parse( "hello world" );
System.out.println( "Query Object = " + q );

Looking at the constructors for
org.apache.lucene.analysis.shingle.ShingleFilter they all seem to want a
token stream, vs. a string.  But I think the default query entry points into
Solr are what's getting me to the single token at a time problem.

I did verify that it's finding my schema, and if I put a non-existent field
name in there, it certainly notices.    I've tried with and without the
PositionFilterFactory filter.  If I comment out the shingle stage everything
works.

    <fieldType name="text_shingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="false"
                />
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="0"
                generateNumberParts="0"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                splitOnCaseChange="0"
                stemEnglishPossessive="0"
                preserveOriginal="0"
        />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="false"/>
        <filter class="solr.PositionFilterFactory" />
      </analyzer>
    </fieldType>


--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Trouble with Shingle filter and query parsing / expansion

Reply via email to