[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492591#comment-13492591 ]
Tom Burton-West commented on SOLR-3589: --------------------------------------- Hi Robert, I just put the backport to 3.6 up on our test server and pointed it to one of our production shards. The improvement for Chinese queries are dramatic. (Especially for longer queries like the TREC 5 queries, see examples below) When you have time, please look over the backport of the patch. I think it is fine but I would appreciate you looking it over. My understanding of your patch is that it just affects a small portion of the edismax logic, but I don't understand the edismax parser well enough to be sure there isn't some difference between 3.6 and 4.0 that I didn't account for in the patch. Thanks for working on this. Naomi and I are both very excited about this bug finally being fixed and want to put the fix into production soon. --- Example TREC 5 Chinese queries: <num> Number: CH4 <E-title> The newly discovered oil fields in China. <C-title> 中国大陆新发现的油田 40,135 items found for 中国大陆新发现的油田 with current implementation (due to dismax bug) 78 items found for 中国大陆新发现的油田 with patch <num> Number: CH10 <E-title> Border Trade in Xinjiang <C-title> 新疆的边境贸易 20,249 items found for 新疆的边境贸易 current implementation (with bug) 243 items found for 新疆的边境贸易 with patch. > Edismax parser does not honor mm parameter if analyzer splits a token > --------------------------------------------------------------------- > > Key: SOLR-3589 > URL: https://issues.apache.org/jira/browse/SOLR-3589 > Project: Solr > Issue Type: Bug > Components: search > Affects Versions: 3.6, 4.0-BETA > Reporter: Tom Burton-West > Assignee: Robert Muir > Attachments: SOLR-3589-3.6.PATCH, SOLR-3589.patch, SOLR-3589.patch, > SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, > testSolr3589.xml.gz, testSolr3589.xml.gz > > > With edismax mm set to 100% if one of the tokens is split into two tokens by > the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is > ignored and the equivalent of OR query for "fire OR fly" is produced. > This is particularly a problem for languages that do not use white space to > separate words such as Chinese or Japenese. > See these messages for more discussion: > http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html > http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html > http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org