[jira] Updated: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

Shalin Shekhar Mangar (JIRA) Mon, 15 Mar 2010 02:09:52 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shalin Shekhar Mangar updated SOLR-1799:
----------------------------------------

    Fix Version/s:     (was: 1.3)
                   1.5

> enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
> ----------------------------------------------------------------------
>
>                 Key: SOLR-1799
>                 URL: https://issues.apache.org/jira/browse/SOLR-1799
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3, 1.4
>            Reporter: Chris Darroch
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1799.patch
>
>
> At the bottom of the WordDelimiterFilter.java code there's the following 
> comment:
> // downsides:  if source text is "powershot" then a query of "PowerShot" 
> won't match!
> Another serious example for us might be something like an indexed document 
> containing the word "Tribeca" or "Soho", and then a user trying to search for 
> "TriBeCa" or "SoHo".
> This issue has turned up in a couple of recent mailing list threads:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e
> In the first thread I found the best explication of what my own 
> misunderstanding was, and it's something I'm sure must trip up other people 
> as well:
> {quote}
> I've misunderstood WordDelimiterFilter.  You might think that catenateAll="1" 
> would append the full phrase (sans delimiters) as an OR against the query.  
> So "jOkersWild" would produce:
> "j (okers wild)" OR "jokerswild"
> But you thought wrong.  Its actually:
> "j (okers wild jokerswild)"
> Which is confusing and won't match...
> {quote}
> In the second thread, Yonik Seeley gives a good explanation of why this 
> occurs, and provides a suggested workaround where you duplicate your data 
> fields and then query on one using generateWordParts="1" and on the other 
> using catenateWords="1".  That works, but obviously requires data 
> duplication.  In our case, we are also following what I believe is 
> recommended practice and duplicating our data already into stemmed and 
> unstemmed indexes.  To my mind, to further duplicate both of these fields a 
> second time, with no difference in the indexed data of the additional copy, 
> seems needlessly wasteful when the problem lies entirely in the query side of 
> things.
> At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, 
> but seems to work for us.  In WordDelimiterFilter, if generateWordParts="1" 
> and catenateWords="2", then we move the concatenated word to overlap its 
> position with the first generated token instead of the last (which is the 
> behaviour with catenateWords="1").  We further insert a preceding dummy flag 
> token with the special type "CATENATE_FIRST".
> In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the 
> entirety of the getFieldQuery() code from Lucene's QueryParser.  This is 
> ugly, I know.  This code is then tweaked so that in the case where the dummy 
> flag token is seen, it creates a BooleanQuery with the following token (the 
> concatenated word) as a conditional TermQuery clause, and then adds the 
> generated terms in their usual MultiPhraseQuery as a second conditional 
> clause.
> Now I realize this patch is (a) not likely acceptable on style and elegance 
> grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; 
> after I'd spent most of what time I had available tracking down the source of 
> the problem, I just needed to get something working quickly.  Perhaps this 
> patch will inspire others to greatness, though, or at a minimum provide a 
> starting point for those who stumble over this same issue.
> Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

Reply via email to