[ 
https://issues.apache.org/jira/browse/SOLR-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199147#comment-13199147
 ] 

Jan Høydahl commented on SOLR-3085:
-----------------------------------

bq. i have a nagging feeling that there are non-stopword cases that would be 
indistinguishable (to the parser) from this type of stopword case, and thus 
would also trigger this logic undesirably, but i can't articulate what they 
might be off the top of my head.

A potential difficult one is this multi language example: {{&qf=title_no 
title_en tags}}. Each of these fields may have their separate stopwords list, 
say title_no has a stopword "men" (norwegian for but) and title_en has stopword 
"the". Then we query {{q=the men}}. The user expectation would be that it would 
return ENGLISH docs matching "men", since "the" is an english stopword.

Today we'd get:
{noformat}
+((DisjunctionMaxQuery((title_no:the | tags:the)~0.01) 
DisjunctionMaxQuery((title_en:men | tags:men)~0.01))~2)
{noformat}

In this case with mm=100% we'd likely get 0 hits, given that "the" is not 
common in either of title_no or tags. However, the parser cannot know whether 
the user's real information need is "the" or "men" - since both are stopwords 
for different fields.

Now, all DisMax clauses in this example have had at least one stopword pruned, 
so using the "mm decrement" strategy would change mm from 2 to 0 which would 
turn this into an OR query - and of course return results. This is a 
compromise, so a better option in this special case would probably be to use 
eDismax's "smart" conditional stopword removal [1], but that requires change of 
fieldType.

The "convert to boost query" approach would only work when we have at least one 
clause without stop words, since we cannot query ONLY with bq. Say two of my 
four query terms {{q=the best cheap holiday}} are stop words, and mm=100%. So 
we remove the two stop clauses from the BooleanQuery and reduce mm accordingly 
from 4 (100%) to 2, and add the two stop clauses as BQs. This approach would 
also work for mm<100% cases, since we only count mm clauses from the non-stop 
clauses.

----
[1] For the special case of all clauses being stop clauses, eDisMax's existing 
"smart" conditional stopword handling could perhaps be another solution? For 
those unfamiliar with it, you can specify {{&stopwords=true}} (which is the 
default) and eDismax will remove stopwords for you instead of letting Analysis 
do it. It requires that you don't have StopFilterFactory in your Analysis. Now, 
if ALL query terms are stopwords, disMax will not remove them, to support 
queries like "Who is the who?". (Q: How does edismax pick up which stopword 
dicationary(ies) to use?). It's of no use to those removing stop-words in their 
"index" analysis though.
                
> Fix the dismax/edismax stopwords mm issue
> -----------------------------------------
>
>                 Key: SOLR-3085
>                 URL: https://issues.apache.org/jira/browse/SOLR-3085
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>            Reporter: Jan Høydahl
>              Labels: MinimumShouldMatch, dismax, stopwords
>             Fix For: 3.6, 4.0
>
>
> As discussed here http://search-lucene.com/m/Wr7iz1a95jx and here 
> http://search-lucene.com/m/Yne042qEyCq1 and here 
> http://search-lucene.com/m/RfAp82nSsla DisMax has an issue with stopwords if 
> not all fields used in QF have exactly same stopword lists.
> Typical solution is to not use stopwords or harmonize stopword lists across 
> all fields in your QF, or relax the MM to a lower percentag. Sometimes these 
> are not acceptable workarounds, and we should find a better solution.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to