Hello. I have an application where I try to match longer queries (sentences)
to short documents (search phrases). Typically, the documents are 3-5 terms
in length. I am facing a problem where phrase match in the indicated phrase
fields via "pf" doesn't seem to match in most cases, and I am stumped.
Please help!

For instance, when my query is "should I buy a house now while the rates are
low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt"

I expect the document "buy a house" to match much higher than "house
loan rates".
However, the latter is the document which always matches higher.


I tried to do this the following way (solr 3.1):
1. Score phrase matches high
2. Score single word matches lower
3. Use dismax with a "mm" of 1, and very high boost for exact phrase match.

I used the s "text" definition in the schema for the single words, and the
following for the phrase:

    <fieldType name="shingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
        catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
        catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
      </analyzer>
    </fieldType>

and my schema fields look like this:

   <field name="kw_stopped" type="text_en" indexed="true" omitNorms="True"
/>

   <!-- keywords almost as is - to provide truer match for full phrases -->
   <field name="kw_phrases" type="shingle" indexed="true" omitNorms="True"
/>

This is my search handler config:

  <requestHandler name="edismax" class="solr.SearchHandler" default="true">
    <lst name="defaults">
     <str name="defType">edismax</str>
     <str name="echoParams">explicit</str>
     <float name="tie">0.1</float>
     <str name="fl">
       kpid,advid,campaign,keywords
     </str>
     <str name="mm">1</str>
     <str name="qf">
       kw_stopped^1.0
     </str>
     <str name="pf">
       kw_phrases^50.0
     </str>
     <int name="ps">3</int>
     <int name="qs">3</int>
     <str name="q.alt">*:*</str>
     <!-- example highlighter config, enable per-query with hl=true -->
     <str name="hl.fl">keywords</str>
     <!-- for this field, we want no fragmenting, just highlighting -->
     <str name="f.name.hl.fragsize">0</str>
     <!-- instructs Solr to return the field itself if no query terms are
          found -->
     <str name="f.name.hl.alternateField">title</str>
     <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
    </lst>
  </requestHandler>

These are the match score debugQuery explanations:

8.480054E-4 = (MATCH) sum of:
  8.480054E-4 = (MATCH) product of:
    0.0031093531 = (MATCH) sum of:
      0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
        2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
          5.514656 = idf(docFreq=25, maxDocs=2375)
          5.1152787E-5 = queryNorm
        5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:hous)=1)
          5.514656 = idf(docFreq=25, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
        2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
          4.002068 = idf(docFreq=117, maxDocs=2375)
          5.1152787E-5 = queryNorm
        4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:rate)=1)
          4.002068 = idf(docFreq=117, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
        1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          5.1152787E-5 = queryNorm
        3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
of:
          1.0 = tf(termFreq(kw_stopped:loan)=1)
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
    0.27272728 = coord(3/11)

for "house loan rates" vs

8.480054E-4 = (MATCH) sum of:
  8.480054E-4 = (MATCH) product of:
    0.0031093531 = (MATCH) sum of:
      0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
        2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
          5.514656 = idf(docFreq=25, maxDocs=2375)
          5.1152787E-5 = queryNorm
        5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:hous)=1)
          5.514656 = idf(docFreq=25, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
        2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
          4.002068 = idf(docFreq=117, maxDocs=2375)
          5.1152787E-5 = queryNorm
        4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:rate)=1)
          4.002068 = idf(docFreq=117, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
        1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          5.1152787E-5 = queryNorm
        3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
of:
          1.0 = tf(termFreq(kw_stopped:loan)=1)
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
    0.27272728 = coord(3/11)

for "buy a house".

Unless I try an exact phrase "buy a house" as the query, the kw_phrases
never shows up in the explanation.

What am I doing wrong? Please help!

thanks,
Vijay

Reply via email to