[jira] Updated: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

2010-02-27 Thread Chris Darroch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Darroch updated SOLR-1799:


Attachment: SOLR-1799.patch

> enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
> --
>
> Key: SOLR-1799
> URL: https://issues.apache.org/jira/browse/SOLR-1799
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.3, 1.4
>Reporter: Chris Darroch
>Priority: Minor
> Fix For: 1.3
>
> Attachments: SOLR-1799.patch
>
>
> At the bottom of the WordDelimiterFilter.java code there's the following 
> comment:
> // downsides:  if source text is "powershot" then a query of "PowerShot" 
> won't match!
> Another serious example for us might be something like an indexed document 
> containing the word "Tribeca" or "Soho", and then a user trying to search for 
> "TriBeCa" or "SoHo".
> This issue has turned up in a couple of recent mailing list threads:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e
> In the first thread I found the best explication of what my own 
> misunderstanding was, and it's something I'm sure must trip up other people 
> as well:
> {quote}
> I've misunderstood WordDelimiterFilter.  You might think that catenateAll="1" 
> would append the full phrase (sans delimiters) as an OR against the query.  
> So "jOkersWild" would produce:
> "j (okers wild)" OR "jokerswild"
> But you thought wrong.  Its actually:
> "j (okers wild jokerswild)"
> Which is confusing and won't match...
> {quote}
> In the second thread, Yonik Seeley gives a good explanation of why this 
> occurs, and provides a suggested workaround where you duplicate your data 
> fields and then query on one using generateWordParts="1" and on the other 
> using catenateWords="1".  That works, but obviously requires data 
> duplication.  In our case, we are also following what I believe is 
> recommended practice and duplicating our data already into stemmed and 
> unstemmed indexes.  To my mind, to further duplicate both of these fields a 
> second time, with no difference in the indexed data of the additional copy, 
> seems needlessly wasteful when the problem lies entirely in the query side of 
> things.
> At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, 
> but seems to work for us.  In WordDelimiterFilter, if generateWordParts="1" 
> and catenateWords="2", then we move the concatenated word to overlap its 
> position with the first generated token instead of the last (which is the 
> behaviour with catenateWords="1").  We further insert a preceding dummy flag 
> token with the special type "CATENATE_FIRST".
> In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the 
> entirety of the getFieldQuery() code from Lucene's QueryParser.  This is 
> ugly, I know.  This code is then tweaked so that in the case where the dummy 
> flag token is seen, it creates a BooleanQuery with the following token (the 
> concatenated word) as a conditional TermQuery clause, and then adds the 
> generated terms in their usual MultiPhraseQuery as a second conditional 
> clause.
> Now I realize this patch is (a) not likely acceptable on style and elegance 
> grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; 
> after I'd spent most of what time I had available tracking down the source of 
> the problem, I just needed to get something working quickly.  Perhaps this 
> patch will inspire others to greatness, though, or at a minimum provide a 
> starting point for those who stumble over this same issue.
> Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

2010-02-27 Thread Chris Darroch (JIRA)
enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
--

 Key: SOLR-1799
 URL: https://issues.apache.org/jira/browse/SOLR-1799
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4, 1.3
Reporter: Chris Darroch
Priority: Minor
 Fix For: 1.3


At the bottom of the WordDelimiterFilter.java code there's the following 
comment:

// downsides:  if source text is "powershot" then a query of "PowerShot" won't 
match!

Another serious example for us might be something like an indexed document 
containing the word "Tribeca" or "Soho", and then a user trying to search for 
"TriBeCa" or "SoHo".

This issue has turned up in a couple of recent mailing list threads:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e

In the first thread I found the best explication of what my own 
misunderstanding was, and it's something I'm sure must trip up other people as 
well:

{quote}
I've misunderstood WordDelimiterFilter.  You might think that catenateAll="1" 
would append the full phrase (sans delimiters) as an OR against the query.  So 
"jOkersWild" would produce:

"j (okers wild)" OR "jokerswild"

But you thought wrong.  Its actually:

"j (okers wild jokerswild)"

Which is confusing and won't match...
{quote}

In the second thread, Yonik Seeley gives a good explanation of why this occurs, 
and provides a suggested workaround where you duplicate your data fields and 
then query on one using generateWordParts="1" and on the other using 
catenateWords="1".  That works, but obviously requires data duplication.  In 
our case, we are also following what I believe is recommended practice and 
duplicating our data already into stemmed and unstemmed indexes.  To my mind, 
to further duplicate both of these fields a second time, with no difference in 
the indexed data of the additional copy, seems needlessly wasteful when the 
problem lies entirely in the query side of things.

At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, but 
seems to work for us.  In WordDelimiterFilter, if generateWordParts="1" and 
catenateWords="2", then we move the concatenated word to overlap its position 
with the first generated token instead of the last (which is the behaviour with 
catenateWords="1").  We further insert a preceding dummy flag token with the 
special type "CATENATE_FIRST".

In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the 
entirety of the getFieldQuery() code from Lucene's QueryParser.  This is ugly, 
I know.  This code is then tweaked so that in the case where the dummy flag 
token is seen, it creates a BooleanQuery with the following token (the 
concatenated word) as a conditional TermQuery clause, and then adds the 
generated terms in their usual MultiPhraseQuery as a second conditional clause.

Now I realize this patch is (a) not likely acceptable on style and elegance 
grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; 
after I'd spent most of what time I had available tracking down the source of 
the problem, I just needed to get something working quickly.  Perhaps this 
patch will inspire others to greatness, though, or at a minimum provide a 
starting point for those who stumble over this same issue.

Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1735) shut down TimeLimitedCollection timer thread on application unload

2010-01-26 Thread Chris Darroch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Darroch updated SOLR-1735:


Attachment: SOLR-1735-1_3.patch

> shut down TimeLimitedCollection timer thread on application unload
> --
>
> Key: SOLR-1735
> URL: https://issues.apache.org/jira/browse/SOLR-1735
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 1.3, 1.4
>Reporter: Chris Darroch
> Attachments: SOLR-1735-1_3.patch
>
>
> As described in https://issues.apache.org/jira/browse/LUCENE-2237, shutting 
> down the timer thread created by Lucene's TimeLimitedCollector allows Tomcat 
> or another application server to cleanly unload solr.war (or any application 
> using Lucene, for that matter).
> I'm attaching two patches for Solr 1.3 which use the patch provided in 
> LUCENE-2237 to shut down the timer thread when a new servlet context listener 
> for the solr.war application is informed the application is about to be 
> unloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1735) shut down TimeLimitedCollection timer thread on application unload

2010-01-26 Thread Chris Darroch (JIRA)
shut down TimeLimitedCollection timer thread on application unload
--

 Key: SOLR-1735
 URL: https://issues.apache.org/jira/browse/SOLR-1735
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4, 1.3
Reporter: Chris Darroch


As described in https://issues.apache.org/jira/browse/LUCENE-2237, shutting 
down the timer thread created by Lucene's TimeLimitedCollector allows Tomcat or 
another application server to cleanly unload solr.war (or any application using 
Lucene, for that matter).

I'm attaching two patches for Solr 1.3 which use the patch provided in 
LUCENE-2237 to shut down the timer thread when a new servlet context listener 
for the solr.war application is informed the application is about to be 
unloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-874) Dismax parser exceptions on trailing OPERATOR

2009-11-20 Thread Chris Darroch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Darroch updated SOLR-874:
---

Attachment: SOLR-874-1.3.patch

Hi, I'm one of the httpd devs but I thought I'd throw in this patch for Solr 
1.3 (I'll try to make one for trunk later) which handles a number of the issues 
raised in this report for us.

First, & and | are escaped, and the dismax logic is changed a little so that if 
the various query-munging methods return a blank string, we fall back to using 
the configured default query.

Next, consecutive + or - chars are flattened to a single char; this handles 
cases where a user might accidentally type --foo when they just mean -foo.

Strings of mixed + and - chars are removed, since we have no way of knowing the 
user's intent without something like +-foo or similar.

Together these two steps handle one of the reported cases where the query 
starts with multiple + or - operators.

Any remaining + or - chars which trail the last term, or which have whitespace 
on their right side, are removed.  Our users found it puzzling in the extreme 
that a search on "questions 1 - 10" explicitly excluded results with "10" in 
them, because "- 10" is treated as -10.  So we just remove any + or - operators 
which aren't right up against the following term.

Finally, we escape AND, OR, and NOT when they appear outside of quotes, and 
remove any trailing unmatched quote.  This changes the previous behaviour which 
removes all quotes if they aren't perfectly balanced; we felt this was more in 
line with what users expect if they mistype and enter an extra quote char.

So far I haven't been able to generate any Lucene query parser exceptions with 
this code, but it doesn't mean it's perfect, obviously -- there may still be 
some way to slip an invalid Lucene query past it.  But I'm cautiously 
optimistic that it covers all or most of the issues raised so far in the thread.

> Dismax parser exceptions on trailing OPERATOR
> -
>
> Key: SOLR-874
> URL: https://issues.apache.org/jira/browse/SOLR-874
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.3
>Reporter: Erik Hatcher
> Fix For: 1.5
>
> Attachments: SOLR-874-1.3.patch, SOLR-874.patch
>
>
> Dismax is supposed to be immune to parse exceptions, but alas it's not:
> http://localhost:8983/solr/select?defType=dismax&qf=name&q=ipod+AND
> kaboom!
> Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'ipod 
> AND': Encountered "" at line 1, column 8.
> Was expecting one of:
>  ...
> "+" ...
> "-" ...
> "(" ...
> "*" ...
>  ...
>  ...
>  ...
>  ...
> "[" ...
> "{" ...
>  ...
>  ...
> "*" ...
> 
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175)
>   at 
> org.apache.solr.search.DismaxQParser.parse(DisMaxQParserPlugin.java:138)
>   at org.apache.solr.search.QParser.getQuery(QParser.java:88)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.