[jira] Updated: (SOLR-1799) enable matching of CamelCase with camelcase in WordDelimiterFilter

2010-03-15 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-1799:


Fix Version/s: (was: 1.3)
   1.5

 enable matching of CamelCase with camelcase in WordDelimiterFilter
 --

 Key: SOLR-1799
 URL: https://issues.apache.org/jira/browse/SOLR-1799
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3, 1.4
Reporter: Chris Darroch
Priority: Minor
 Fix For: 1.5

 Attachments: SOLR-1799.patch


 At the bottom of the WordDelimiterFilter.java code there's the following 
 comment:
 // downsides:  if source text is powershot then a query of PowerShot 
 won't match!
 Another serious example for us might be something like an indexed document 
 containing the word Tribeca or Soho, and then a user trying to search for 
 TriBeCa or SoHo.
 This issue has turned up in a couple of recent mailing list threads:
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e
 In the first thread I found the best explication of what my own 
 misunderstanding was, and it's something I'm sure must trip up other people 
 as well:
 {quote}
 I've misunderstood WordDelimiterFilter.  You might think that catenateAll=1 
 would append the full phrase (sans delimiters) as an OR against the query.  
 So jOkersWild would produce:
 j (okers wild) OR jokerswild
 But you thought wrong.  Its actually:
 j (okers wild jokerswild)
 Which is confusing and won't match...
 {quote}
 In the second thread, Yonik Seeley gives a good explanation of why this 
 occurs, and provides a suggested workaround where you duplicate your data 
 fields and then query on one using generateWordParts=1 and on the other 
 using catenateWords=1.  That works, but obviously requires data 
 duplication.  In our case, we are also following what I believe is 
 recommended practice and duplicating our data already into stemmed and 
 unstemmed indexes.  To my mind, to further duplicate both of these fields a 
 second time, with no difference in the indexed data of the additional copy, 
 seems needlessly wasteful when the problem lies entirely in the query side of 
 things.
 At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, 
 but seems to work for us.  In WordDelimiterFilter, if generateWordParts=1 
 and catenateWords=2, then we move the concatenated word to overlap its 
 position with the first generated token instead of the last (which is the 
 behaviour with catenateWords=1).  We further insert a preceding dummy flag 
 token with the special type CATENATE_FIRST.
 In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the 
 entirety of the getFieldQuery() code from Lucene's QueryParser.  This is 
 ugly, I know.  This code is then tweaked so that in the case where the dummy 
 flag token is seen, it creates a BooleanQuery with the following token (the 
 concatenated word) as a conditional TermQuery clause, and then adds the 
 generated terms in their usual MultiPhraseQuery as a second conditional 
 clause.
 Now I realize this patch is (a) not likely acceptable on style and elegance 
 grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; 
 after I'd spent most of what time I had available tracking down the source of 
 the problem, I just needed to get something working quickly.  Perhaps this 
 patch will inspire others to greatness, though, or at a minimum provide a 
 starting point for those who stumble over this same issue.
 Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1799) enable matching of CamelCase with camelcase in WordDelimiterFilter

2010-02-27 Thread Chris Darroch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Darroch updated SOLR-1799:


Attachment: SOLR-1799.patch

 enable matching of CamelCase with camelcase in WordDelimiterFilter
 --

 Key: SOLR-1799
 URL: https://issues.apache.org/jira/browse/SOLR-1799
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.3, 1.4
Reporter: Chris Darroch
Priority: Minor
 Fix For: 1.3

 Attachments: SOLR-1799.patch


 At the bottom of the WordDelimiterFilter.java code there's the following 
 comment:
 // downsides:  if source text is powershot then a query of PowerShot 
 won't match!
 Another serious example for us might be something like an indexed document 
 containing the word Tribeca or Soho, and then a user trying to search for 
 TriBeCa or SoHo.
 This issue has turned up in a couple of recent mailing list threads:
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e
 In the first thread I found the best explication of what my own 
 misunderstanding was, and it's something I'm sure must trip up other people 
 as well:
 {quote}
 I've misunderstood WordDelimiterFilter.  You might think that catenateAll=1 
 would append the full phrase (sans delimiters) as an OR against the query.  
 So jOkersWild would produce:
 j (okers wild) OR jokerswild
 But you thought wrong.  Its actually:
 j (okers wild jokerswild)
 Which is confusing and won't match...
 {quote}
 In the second thread, Yonik Seeley gives a good explanation of why this 
 occurs, and provides a suggested workaround where you duplicate your data 
 fields and then query on one using generateWordParts=1 and on the other 
 using catenateWords=1.  That works, but obviously requires data 
 duplication.  In our case, we are also following what I believe is 
 recommended practice and duplicating our data already into stemmed and 
 unstemmed indexes.  To my mind, to further duplicate both of these fields a 
 second time, with no difference in the indexed data of the additional copy, 
 seems needlessly wasteful when the problem lies entirely in the query side of 
 things.
 At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, 
 but seems to work for us.  In WordDelimiterFilter, if generateWordParts=1 
 and catenateWords=2, then we move the concatenated word to overlap its 
 position with the first generated token instead of the last (which is the 
 behaviour with catenateWords=1).  We further insert a preceding dummy flag 
 token with the special type CATENATE_FIRST.
 In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the 
 entirety of the getFieldQuery() code from Lucene's QueryParser.  This is 
 ugly, I know.  This code is then tweaked so that in the case where the dummy 
 flag token is seen, it creates a BooleanQuery with the following token (the 
 concatenated word) as a conditional TermQuery clause, and then adds the 
 generated terms in their usual MultiPhraseQuery as a second conditional 
 clause.
 Now I realize this patch is (a) not likely acceptable on style and elegance 
 grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; 
 after I'd spent most of what time I had available tracking down the source of 
 the problem, I just needed to get something working quickly.  Perhaps this 
 patch will inspire others to greatness, though, or at a minimum provide a 
 starting point for those who stumble over this same issue.
 Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.