[
https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated SOLR-1799:
----------------------------------------
Fix Version/s: (was: 1.3)
1.5
> enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
> ----------------------------------------------------------------------
>
> Key: SOLR-1799
> URL: https://issues.apache.org/jira/browse/SOLR-1799
> Project: Solr
> Issue Type: Improvement
> Components: search
> Affects Versions: 1.3, 1.4
> Reporter: Chris Darroch
> Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1799.patch
>
>
> At the bottom of the WordDelimiterFilter.java code there's the following
> comment:
> // downsides: if source text is "powershot" then a query of "PowerShot"
> won't match!
> Another serious example for us might be something like an indexed document
> containing the word "Tribeca" or "Soho", and then a user trying to search for
> "TriBeCa" or "SoHo".
> This issue has turned up in a couple of recent mailing list threads:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%[email protected]%3e
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%[email protected]%3e
> In the first thread I found the best explication of what my own
> misunderstanding was, and it's something I'm sure must trip up other people
> as well:
> {quote}
> I've misunderstood WordDelimiterFilter. You might think that catenateAll="1"
> would append the full phrase (sans delimiters) as an OR against the query.
> So "jOkersWild" would produce:
> "j (okers wild)" OR "jokerswild"
> But you thought wrong. Its actually:
> "j (okers wild jokerswild)"
> Which is confusing and won't match...
> {quote}
> In the second thread, Yonik Seeley gives a good explanation of why this
> occurs, and provides a suggested workaround where you duplicate your data
> fields and then query on one using generateWordParts="1" and on the other
> using catenateWords="1". That works, but obviously requires data
> duplication. In our case, we are also following what I believe is
> recommended practice and duplicating our data already into stemmed and
> unstemmed indexes. To my mind, to further duplicate both of these fields a
> second time, with no difference in the indexed data of the additional copy,
> seems needlessly wasteful when the problem lies entirely in the query side of
> things.
> At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky,
> but seems to work for us. In WordDelimiterFilter, if generateWordParts="1"
> and catenateWords="2", then we move the concatenated word to overlap its
> position with the first generated token instead of the last (which is the
> behaviour with catenateWords="1"). We further insert a preceding dummy flag
> token with the special type "CATENATE_FIRST".
> In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the
> entirety of the getFieldQuery() code from Lucene's QueryParser. This is
> ugly, I know. This code is then tweaked so that in the case where the dummy
> flag token is seen, it creates a BooleanQuery with the following token (the
> concatenated word) as a conditional TermQuery clause, and then adds the
> generated terms in their usual MultiPhraseQuery as a second conditional
> clause.
> Now I realize this patch is (a) not likely acceptable on style and elegance
> grounds, and (b) only against Solr 1.3, not trunk. My apologies for both;
> after I'd spent most of what time I had available tracking down the source of
> the problem, I just needed to get something working quickly. Perhaps this
> patch will inspire others to greatness, though, or at a minimum provide a
> starting point for those who stumble over this same issue.
> Thanks for a great application! Cheers.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.