[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-10-20 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010_141.patch

update the 1.4.1 patch to include Yonik's fix.

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, SOLR-2010_141.patch, 
 SOLR-2010_shardRecombineCollations_993538.patch, 
 SOLR-2010_shardRecombineCollations_999521.patch, 
 SOLR-2010_shardSearchHandler_993538.patch, 
 SOLR-2010_shardSearchHandler_999521.patch, solr_2010_3x.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
   str name=faillfaith/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(chops AND all)/str
   int name=hits1/int
   lst name=misspellingsAndCorrections
   

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-10-20 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: multiple_collations_as_an_array.patch

Here's an attempt to implement Yonik's suggestion to have multiple collations 
returned as an Array rather than use repeated keys.  I am not familiar with 
JSON so I didn't realize the original format would cause problems.  

From this perspective, however, I like the original version better.  The 
problem is in order to maintain backwards-compatibility, if 
spellcheck.maxCollations is unset or set to 1, then we need to return a 
single String with key collation.  This patch alters the response only if 
spellcheck.maxCollations is 1, instead returning an array with key 
collations.  

I also changed the distributed code and solrj to cope with the change in 
format.  All tests pass, but maybe someone will find a better solution than 
this, or perhaps we can leave it as is.

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: multiple_collations_as_an_array.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, 
 SOLR-2010_141.patch, SOLR-2010_141.patch, 
 SOLR-2010_shardRecombineCollations_993538.patch, 
 SOLR-2010_shardRecombineCollations_999521.patch, 
 SOLR-2010_shardSearchHandler_993538.patch, 
 SOLR-2010_shardSearchHandler_999521.patch, solr_2010_3x.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
 

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-10-19 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: solr_2010_3x.patch

Here is a patch for the 3.x branch.  This includes Yonik's fix to close the 
searcher (thanks!).  All tests pass.

Grant, do you feel this is something that can safely go into the 3.x branch in 
addition to Trunk?

(by the way, I am looking into Yonik's suggestion to change multiple collation 
results into an Array.  The trick here, I think, is to not break 
backwards-compatibility...)

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, 
 SOLR-2010_shardRecombineCollations_993538.patch, 
 SOLR-2010_shardRecombineCollations_999521.patch, 
 SOLR-2010_shardSearchHandler_993538.patch, 
 SOLR-2010_shardSearchHandler_999521.patch, solr_2010_3x.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
 

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-09-22 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010_141.patch

This version is for v1.4.1.  No shard support as SpellCheckComponent does not 
have any distributed support in 1.4.  All tests pass.

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, 
 SOLR-2010_shardRecombineCollations_993538.patch, 
 SOLR-2010_shardRecombineCollations_999521.patch, 
 SOLR-2010_shardSearchHandler_993538.patch, 
 SOLR-2010_shardSearchHandler_999521.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
   str name=faillfaith/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(chops AND all)/str
   int name=hits1/int
   lst 

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-09-21 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010_shardSearchHandler_999521.patch
SOLR-2010_shardRecombineCollations_999521.patch

Both patch versions sync'ed to Trunk version 999521. (sorry about the many 
filename variants)

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.txt, 
 SOLR-2010_shardRecombineCollations_993538.patch, 
 SOLR-2010_shardRecombineCollations_999521.patch, 
 SOLR-2010_shardSearchHandler_993538.patch, 
 SOLR-2010_shardSearchHandler_999521.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
   str name=faillfaith/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(chops AND all)/str
   int name=hits1/int
  

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-09-08 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010_shardSearchHandler_993538.patch
SOLR-2010_shardRecombineCollations_993538.patch

Two new versions of the patch:

1. SOLR-2010_shardSearchHandler_993538.patch is the same as the 8/23/2010 
version except it applies cleanly to trunk revision #993538.  In a Distributed 
setup, this version calls an overloaded method on SearchHandler to use its 
logic for combining results from the collation test queries.  This is simpler 
code but requires many more round-trips between shards.  We also can guarantee 
that a Distributed setup will always return the exact same collations in order 
as a non-Distributed setup.  

2. SOLR-2010_shardRecombineCollations_993538.patch is similar to the 8/19/2010 
version, with improvements.  This version also applies cleanly to trunk 
revision #993538.  In a Distributed setup, each shard calls QueryComponent 
individually and generates its own list of Collations.  The SpellCheckComponent 
then combines and sorts the resulting collations, returning the best ones, up 
to the client-specified maximum.  This requires more complicated logic in 
SpellCheckComponent.finishStage(), although it does not necessitate changes to 
SearchHandler or ResponseBuilder.  It may be possible to find cases where a 
Distributed setup may return different collations--or the same collations in a 
different order--than a non-distributed setup.  I do not believe this potential 
disparity would ever be very significant.

Grant, I believe version 1 is something like what you were thinking of on 8/9 
and 8/19.  Version 2 is more like what you describe in your comment from 8/30.  
Let me know if you think this needs any more tweaking.  ALSO, if you're 
thinking of possibly committing this someday, you may want to look at SOLR-2049 
also.  Based on my understanding, distributed SpellCheckComponent as exists 
currently in Trunk is broken.  (If I'm right), we may want to fix it before 
adding on more functionality.

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.txt, 
 SOLR-2010_shardRecombineCollations_993538.patch, 
 SOLR-2010_shardSearchHandler_993538.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-08-23 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010.patch

New Patch Version with Shard Support.  Grant, I hope I'm getting closer to what 
you have in mind this time around.

I think I've figured how to send the collation test queries back to 
SearchHandler and have it take care of querying the shards individually.  Then 
the collation logic is no different for distributed / non-distributed.

As I would like to eventually use this in production here, any comments as to 
how to further make this a production-quality feature are much appreciated.

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.patch, SOLR-2010.txt


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
   str name=faillfaith/str
   /lst
   /lst
   lst name=collation

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-08-19 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010.patch

Third version (with .patch extension.  I had used .txt extension with 2nd 
version).  Works with trunk rev#986945.

This time SpellCheckCollator calls the SearchHandler instead of calling the 
QueryComponent.  This required exposing a reference to the SearchHandler on the 
ResponseBuilder.  Also a new overloaded method in 
SearchHandler.processRequestBody() lets you override the list of components to 
run.  In this case we just have it run QueryComponent.

This revision has 2 potential benefits: 
 
(1) the overloaded method in SearchHandler may prove useful to other components 
in the future.  

(2) there may be a way to get SearchHandler to requery all the shards at once 
and then there would be no need to reintegrate the Collations in 
SearchHandler.finishStage().  However, see my comment in SpellCheckCollator 
lines 56-57.  Likely I am calling SpellCheckCollator during the wrong stage 
of the distributed request but I a need to find out more specifically how 
shards work to determine how to further improve this here.  As time allows I 
will do my own investigating but anyone's advice would be greatly appreciated.

Finally, this version corrects a bug that would have caused one of the test 
scenarios in DistributedSpellCheckComponentTest to fail.  Unfortunately in the 
2nd version, I had left some scenarios commented-out and did not catch this 
until now.


 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, 
 SOLR-2010.txt


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
  

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-08-18 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010.txt

Second version of patch.  Updated to trunk rev #986945.

Adds support for shards.  I originally implemented this by passing the 
SearchHandler to the SpellCheckComponent and then using an overloaded version 
of SearchHandler.handleRequestBody() to do the re-queries.  I found this was 
unnecessary as we get the same results by calling the QueryComponent directly.  

I added some test scenarios to DistributedSpellCheckComponentTest and all 
pass.  However, I am a bit disturbed to find that the test fails if I uncomment 
the constructor (added with this patch).  The constructor simply tells it to 
test only with 4 shards rather than trying 1 shard, then 2, etc.  I found 
either way the 4-shard test results in the same docs going to the same shards.  
Yet the results are different.  Specifically the ranking/ordering of the 
collations returned and the # of hits reported are sometimes wrong when the 
constructor is called before the test.  Unfortunately I am at a loss as to why 
I get inconsistent results here and anyone's assistance on this would be most 
helpful. 

I also added an additional unit test method to verify this works when multiple 
request handlers are configured with different qf params.  I also added a 
unit test method that verifies this works when fq is set.



 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
 

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-08-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-2010:
--

Attachment: SOLR-2010.patch

Added license headers

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2010.patch, SOLR-2010.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
   str name=faillfaith/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(chops AND all)/str
   int name=hits1/int
   lst name=misspellingsAndCorrections
   str name=hopqchops/str
   str name=faillall/str
   /lst
   /lst
   /lst
 /lst
 In addition, SOLRJ is updated to include 
 SpellCheckResponse.getCollatedResults(), which will return the expanded 
 Collation format.  getCollatedResult(), which returns a 

[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality

2010-07-22 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2010:
-

Attachment: SOLR-2010.patch

Tested against branch version #96633

 Improvements to SpellCheckComponent Collate functionality
 -

 Key: SOLR-2010
 URL: https://issues.apache.org/jira/browse/SOLR-2010
 Project: Solr
  Issue Type: New Feature
  Components: clients - java, spellchecker
Affects Versions: 1.4.1
 Environment: Tested against trunk revision 966633
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2010.patch


 Improvements to SpellCheckComponent Collate functionality
 Our project requires a better Spell Check Collator.  I'm contributing this as 
 a patch to get suggestions for improvements and in case there is a broader 
 need for these features.
 1. Only return collations that are guaranteed to result in hits if re-queried 
 (applying original fq params also).  This is especially helpful when there is 
 more than one correction per query.  The 1.4 behavior does not verify that a 
 particular combination will actually return hits.
 2. Provide the option to get multiple collation suggestions
 3. Provide extended collation results including the # of hits re-querying 
 will return and a breakdown of each misspelled word and its correction.
 This patch is similar to what is described in SOLR-507 item #1.  Also, this 
 patch provides a viable workaround for the problem discussed in SOLR-1074.  A 
 dictionary could be created that combines the terms from the multiple fields. 
  The collator then would prune out any spurious suggestions this would cause.
 This patch adds the following spellcheck parameters:
 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try 
 before giving up.  Lower values ensure better performance.  Higher values may 
 be necessary to find a collation that can return results.  Default is 0, 
 which maintains backwards-compatible behavior (do not check collations).
 2. spellcheck.maxCollations - maximum # of collations to return.  Default is 
 1, which maintains backwards-compatible behavior.
 3. spellcheck.collateExtendedResult - if true, returns an expanded response 
 format detailing collations found.  default is false, which maintains 
 backwards-compatible behavior.  When true, output is like this (in context):
 lst name=spellcheck
   lst name=suggestions
   lst name=hopq
   int name=numFound94/int
   int name=startOffset7/int
   int name=endOffset11/int
   arr name=suggestion
   strhope/str
   strhow/str
   strhope/str
   strchops/str
   strhoped/str
   etc
   /arr
   lst name=faill
   int name=numFound100/int
   int name=startOffset16/int
   int name=endOffset21/int
   arr name=suggestion
   strfall/str
   strfails/str
   strfail/str
   strfill/str
   strfaith/str
   strall/str
   etc
   /arr
   /lst
   lst name=collation
   str name=collationQueryTitle:(how AND fails)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhow/str
   str name=faillfails/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(hope AND faith)/str
   int name=hits2/int
   lst name=misspellingsAndCorrections
   str name=hopqhope/str
   str name=faillfaith/str
   /lst
   /lst
   lst name=collation
   str name=collationQueryTitle:(chops AND all)/str
   int name=hits1/int
   lst name=misspellingsAndCorrections
   str name=hopqchops/str
   str name=faillall/str
   /lst
   /lst
   /lst
 /lst
 In addition, SOLRJ is updated to include 
 SpellCheckResponse.getCollatedResults(), which will return the expanded 
 Collation format.  getCollatedResult(), which returns a single String, is 
 retained for