[
https://issues.apache.org/jira/browse/SOLR-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199147#comment-13199147
]
Jan Høydahl commented on SOLR-3085:
-----------------------------------
bq. i have a nagging feeling that there are non-stopword cases that would be
indistinguishable (to the parser) from this type of stopword case, and thus
would also trigger this logic undesirably, but i can't articulate what they
might be off the top of my head.
A potential difficult one is this multi language example: {{&qf=title_no
title_en tags}}. Each of these fields may have their separate stopwords list,
say title_no has a stopword "men" (norwegian for but) and title_en has stopword
"the". Then we query {{q=the men}}. The user expectation would be that it would
return ENGLISH docs matching "men", since "the" is an english stopword.
Today we'd get:
{noformat}
+((DisjunctionMaxQuery((title_no:the | tags:the)~0.01)
DisjunctionMaxQuery((title_en:men | tags:men)~0.01))~2)
{noformat}
In this case with mm=100% we'd likely get 0 hits, given that "the" is not
common in either of title_no or tags. However, the parser cannot know whether
the user's real information need is "the" or "men" - since both are stopwords
for different fields.
Now, all DisMax clauses in this example have had at least one stopword pruned,
so using the "mm decrement" strategy would change mm from 2 to 0 which would
turn this into an OR query - and of course return results. This is a
compromise, so a better option in this special case would probably be to use
eDismax's "smart" conditional stopword removal [1], but that requires change of
fieldType.
The "convert to boost query" approach would only work when we have at least one
clause without stop words, since we cannot query ONLY with bq. Say two of my
four query terms {{q=the best cheap holiday}} are stop words, and mm=100%. So
we remove the two stop clauses from the BooleanQuery and reduce mm accordingly
from 4 (100%) to 2, and add the two stop clauses as BQs. This approach would
also work for mm<100% cases, since we only count mm clauses from the non-stop
clauses.
----
[1] For the special case of all clauses being stop clauses, eDisMax's existing
"smart" conditional stopword handling could perhaps be another solution? For
those unfamiliar with it, you can specify {{&stopwords=true}} (which is the
default) and eDismax will remove stopwords for you instead of letting Analysis
do it. It requires that you don't have StopFilterFactory in your Analysis. Now,
if ALL query terms are stopwords, disMax will not remove them, to support
queries like "Who is the who?". (Q: How does edismax pick up which stopword
dicationary(ies) to use?). It's of no use to those removing stop-words in their
"index" analysis though.
> Fix the dismax/edismax stopwords mm issue
> -----------------------------------------
>
> Key: SOLR-3085
> URL: https://issues.apache.org/jira/browse/SOLR-3085
> Project: Solr
> Issue Type: Bug
> Components: search
> Reporter: Jan Høydahl
> Labels: MinimumShouldMatch, dismax, stopwords
> Fix For: 3.6, 4.0
>
>
> As discussed here http://search-lucene.com/m/Wr7iz1a95jx and here
> http://search-lucene.com/m/Yne042qEyCq1 and here
> http://search-lucene.com/m/RfAp82nSsla DisMax has an issue with stopwords if
> not all fields used in QF have exactly same stopword lists.
> Typical solution is to not use stopwords or harmonize stopword lists across
> all fields in your QF, or relax the MM to a lower percentag. Sometimes these
> are not acceptable workarounds, and we should find a better solution.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]