mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

Naomi Dushay Tue, 03 Sep 2013 17:56:01 -0700

When I have a field using CJKBigramFilter,  parsed CJK chars have a different 
parsedQuery than  non-CJK  queries.


  (旧小说 is 3 chars, so 2 bigrams)

args sent in:       q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=

 debugQuery
   <str name="rawquerystring">{!qf=bi_fld}旧小说</str>
   <str name="querystring">{!qf=bi_fld}旧小说</str>
   <str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 
bi_fld:小说)~2))~0.01) ())/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()</str>


If i use a non-CJK query string, with the same field:

args sent in:      q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=

debugQuery:
   <str name="rawquerystring">{!qf=bi_fld}foo bar</str>
   <str name="querystring">{!qf=bi_fld}foo bar</str>
   <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) 
DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:foo)~0.01 
(bi_fld:bar)~0.01)~2)</str>


Why are the  parsedquery_toString   formula different?  And is there any 
difference in the actual relevancy formula?    

How can you tell the difference between the MinNrShouldMatch and a qs or ps or 
tie value, if they are all represented as ~n  in the parsedQuery string?


To try to get a handle on qs, ps, tie and mm:

 args:  q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) 
DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) 
DisjunctionMaxQuery((bi_fld:"c d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 
(bi_fld:d)~0.01)~3) (bi_fld:"c d"~4)~0.01</str>


I get that qs, the query slop, is for explicit phrases in the query, so "a b"~5 
   makes sense.   I also get that ps is for boosting of phrases, so I get  
(bi_fld:"c d"~4) … but where is   (cjk_uni_pub_search:"a b c d"~4)  ?


Using dismax (instead of edismax):

args:   q={!dismax  qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) 
DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) 
DisjunctionMaxQuery((bi_fld:"a b c d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 
(bi_fld:d)~0.01)~3) (bi_fld:"a b c d"~4)~0.01</str>


So is this an edismax bug?



FYI,   I am running Solr 4.4. I have fields defined like so:
<fieldtype name="text_cjk_bi" class="solr.TextField" 
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" 
katakana="true" hangul="true" outputUnigrams="false" />
  </analyzer>
</fieldtype>

The request handler uses edismax:

<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">:</str>
<str name="mm">6<-1 6<90%</str>
<int name="qs">1</int>
<int name="ps">0</int>

mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

Reply via email to