Hello. I have an application where I try to match longer queries (sentences) to short documents (search phrases). Typically, the documents are 3-5 terms in length. I am facing a problem where phrase match in the indicated phrase fields via "pf" doesn't seem to match in most cases, and I am stumped. Please help!
For instance, when my query is "should I buy a house now while the rates are low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt" I expect the document "buy a house" to match much higher than "house loan rates". However, the latter is the document which always matches higher. I tried to do this the following way (solr 3.1): 1. Score phrase matches high 2. Score single word matches lower 3. Use dismax with a "mm" of 1, and very high boost for exact phrase match. I used the s "text" definition in the schema for the single words, and the following for the phrase: <fieldType name="shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="false"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="false"/> </analyzer> </fieldType> and my schema fields look like this: <field name="kw_stopped" type="text_en" indexed="true" omitNorms="True" /> <!-- keywords almost as is - to provide truer match for full phrases --> <field name="kw_phrases" type="shingle" indexed="true" omitNorms="True" /> This is my search handler config: <requestHandler name="edismax" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="defType">edismax</str> <str name="echoParams">explicit</str> <float name="tie">0.1</float> <str name="fl"> kpid,advid,campaign,keywords </str> <str name="mm">1</str> <str name="qf"> kw_stopped^1.0 </str> <str name="pf"> kw_phrases^50.0 </str> <int name="ps">3</int> <int name="qs">3</int> <str name="q.alt">*:*</str> <!-- example highlighter config, enable per-query with hl=true --> <str name="hl.fl">keywords</str> <!-- for this field, we want no fragmenting, just highlighting --> <str name="f.name.hl.fragsize">0</str> <!-- instructs Solr to return the field itself if no query terms are found --> <str name="f.name.hl.alternateField">title</str> <str name="f.text.hl.fragmenter">regex</str> <!-- defined below --> </lst> </requestHandler> These are the match score debugQuery explanations: 8.480054E-4 = (MATCH) sum of: 8.480054E-4 = (MATCH) product of: 0.0031093531 = (MATCH) sum of: 0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of: 2.8209004E-4 = queryWeight(kw_stopped:hous), product of: 5.514656 = idf(docFreq=25, maxDocs=2375) 5.1152787E-5 = queryNorm 5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of: 1.0 = tf(termFreq(kw_stopped:hous)=1) 5.514656 = idf(docFreq=25, maxDocs=2375) 1.0 = fieldNorm(field=kw_stopped, doc=1812) 8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of: 2.0471694E-4 = queryWeight(kw_stopped:rate), product of: 4.002068 = idf(docFreq=117, maxDocs=2375) 5.1152787E-5 = queryNorm 4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of: 1.0 = tf(termFreq(kw_stopped:rate)=1) 4.002068 = idf(docFreq=117, maxDocs=2375) 1.0 = fieldNorm(field=kw_stopped, doc=1812) 7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of: 1.9382538E-4 = queryWeight(kw_stopped:loan), product of: 3.7891462 = idf(docFreq=145, maxDocs=2375) 5.1152787E-5 = queryNorm 3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product of: 1.0 = tf(termFreq(kw_stopped:loan)=1) 3.7891462 = idf(docFreq=145, maxDocs=2375) 1.0 = fieldNorm(field=kw_stopped, doc=1812) 0.27272728 = coord(3/11) for "house loan rates" vs 8.480054E-4 = (MATCH) sum of: 8.480054E-4 = (MATCH) product of: 0.0031093531 = (MATCH) sum of: 0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of: 2.8209004E-4 = queryWeight(kw_stopped:hous), product of: 5.514656 = idf(docFreq=25, maxDocs=2375) 5.1152787E-5 = queryNorm 5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of: 1.0 = tf(termFreq(kw_stopped:hous)=1) 5.514656 = idf(docFreq=25, maxDocs=2375) 1.0 = fieldNorm(field=kw_stopped, doc=1812) 8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of: 2.0471694E-4 = queryWeight(kw_stopped:rate), product of: 4.002068 = idf(docFreq=117, maxDocs=2375) 5.1152787E-5 = queryNorm 4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of: 1.0 = tf(termFreq(kw_stopped:rate)=1) 4.002068 = idf(docFreq=117, maxDocs=2375) 1.0 = fieldNorm(field=kw_stopped, doc=1812) 7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of: 1.9382538E-4 = queryWeight(kw_stopped:loan), product of: 3.7891462 = idf(docFreq=145, maxDocs=2375) 5.1152787E-5 = queryNorm 3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product of: 1.0 = tf(termFreq(kw_stopped:loan)=1) 3.7891462 = idf(docFreq=145, maxDocs=2375) 1.0 = fieldNorm(field=kw_stopped, doc=1812) 0.27272728 = coord(3/11) for "buy a house". Unless I try an exact phrase "buy a house" as the query, the kw_phrases never shows up in the explanation. What am I doing wrong? Please help! thanks, Vijay