[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

Tom Burton-West (JIRA) Wed, 07 Nov 2012 11:02:15 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492591#comment-13492591
 ]


Tom Burton-West commented on SOLR-3589:
---------------------------------------

Hi Robert,

I just put the backport to 3.6 up on our test server and pointed it to one of 
our production shards.  The improvement for Chinese queries  are dramatic.  
(Especially for longer queries like the TREC 5 queries, see examples below)

When you have time, please look over the backport of the patch.  I think it is 
fine but I would appreciate you looking it over.  My understanding of your 
patch is that it just affects a small portion of the edismax logic, but I don't 
understand the edismax parser well enough to be sure there isn't some 
difference between 3.6 and 4.0 that I didn't account for in the patch.

Thanks for working on this.   Naomi and I are both very excited about this bug 
finally being fixed and want to put the fix into production soon.
---
Example TREC 5 Chinese queries:

<num> Number: CH4
<E-title> The newly discovered oil fields in China.
<C-title> 中国大陆新发现的油田   
40,135 items found for 中国大陆新发现的油田 with current implementation (due to dismax 
bug)
78 items found for 中国大陆新发现的油田 with patch

<num> Number: CH10
<E-title> Border Trade in Xinjiang
<C-title> 新疆的边境贸易  
20,249 items found for 新疆的边境贸易  current implementation (with bug)
243 items found for 新疆的边境贸易      with patch.

                
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3589
>                 URL: https://issues.apache.org/jira/browse/SOLR-3589
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.6, 4.0-BETA
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>         Attachments: SOLR-3589-3.6.PATCH, SOLR-3589.patch, SOLR-3589.patch, 
> SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, 
> testSolr3589.xml.gz, testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

Reply via email to