[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010_141.patch update the 1.4.1 patch to include Yonik's fix. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, SOLR-2010_141.patch, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardRecombineCollations_999521.patch, SOLR-2010_shardSearchHandler_993538.patch, SOLR-2010_shardSearchHandler_999521.patch, solr_2010_3x.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str str name=faillfaith/str /lst /lst lst name=collation str name=collationQueryTitle:(chops AND all)/str int name=hits1/int lst name=misspellingsAndCorrections
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: multiple_collations_as_an_array.patch Here's an attempt to implement Yonik's suggestion to have multiple collations returned as an Array rather than use repeated keys. I am not familiar with JSON so I didn't realize the original format would cause problems. From this perspective, however, I like the original version better. The problem is in order to maintain backwards-compatibility, if spellcheck.maxCollations is unset or set to 1, then we need to return a single String with key collation. This patch alters the response only if spellcheck.maxCollations is 1, instead returning an array with key collations. I also changed the distributed code and solrj to cope with the change in format. All tests pass, but maybe someone will find a better solution than this, or perhaps we can leave it as is. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: multiple_collations_as_an_array.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, SOLR-2010_141.patch, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardRecombineCollations_999521.patch, SOLR-2010_shardSearchHandler_993538.patch, SOLR-2010_shardSearchHandler_999521.patch, solr_2010_3x.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: solr_2010_3x.patch Here is a patch for the 3.x branch. This includes Yonik's fix to close the searcher (thanks!). All tests pass. Grant, do you feel this is something that can safely go into the 3.x branch in addition to Trunk? (by the way, I am looking into Yonik's suggestion to change multiple collation results into an Array. The trick here, I think, is to not break backwards-compatibility...) Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardRecombineCollations_999521.patch, SOLR-2010_shardSearchHandler_993538.patch, SOLR-2010_shardSearchHandler_999521.patch, solr_2010_3x.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010_141.patch This version is for v1.4.1. No shard support as SpellCheckComponent does not have any distributed support in 1.4. All tests pass. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_141.patch, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardRecombineCollations_999521.patch, SOLR-2010_shardSearchHandler_993538.patch, SOLR-2010_shardSearchHandler_999521.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str str name=faillfaith/str /lst /lst lst name=collation str name=collationQueryTitle:(chops AND all)/str int name=hits1/int lst
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010_shardSearchHandler_999521.patch SOLR-2010_shardRecombineCollations_999521.patch Both patch versions sync'ed to Trunk version 999521. (sorry about the many filename variants) Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardRecombineCollations_999521.patch, SOLR-2010_shardSearchHandler_993538.patch, SOLR-2010_shardSearchHandler_999521.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str str name=faillfaith/str /lst /lst lst name=collation str name=collationQueryTitle:(chops AND all)/str int name=hits1/int
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010_shardSearchHandler_993538.patch SOLR-2010_shardRecombineCollations_993538.patch Two new versions of the patch: 1. SOLR-2010_shardSearchHandler_993538.patch is the same as the 8/23/2010 version except it applies cleanly to trunk revision #993538. In a Distributed setup, this version calls an overloaded method on SearchHandler to use its logic for combining results from the collation test queries. This is simpler code but requires many more round-trips between shards. We also can guarantee that a Distributed setup will always return the exact same collations in order as a non-Distributed setup. 2. SOLR-2010_shardRecombineCollations_993538.patch is similar to the 8/19/2010 version, with improvements. This version also applies cleanly to trunk revision #993538. In a Distributed setup, each shard calls QueryComponent individually and generates its own list of Collations. The SpellCheckComponent then combines and sorts the resulting collations, returning the best ones, up to the client-specified maximum. This requires more complicated logic in SpellCheckComponent.finishStage(), although it does not necessitate changes to SearchHandler or ResponseBuilder. It may be possible to find cases where a Distributed setup may return different collations--or the same collations in a different order--than a non-distributed setup. I do not believe this potential disparity would ever be very significant. Grant, I believe version 1 is something like what you were thinking of on 8/9 and 8/19. Version 2 is more like what you describe in your comment from 8/30. Let me know if you think this needs any more tweaking. ALSO, if you're thinking of possibly committing this someday, you may want to look at SOLR-2049 also. Based on my understanding, distributed SpellCheckComponent as exists currently in Trunk is broken. (If I'm right), we may want to fix it before adding on more functionality. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt, SOLR-2010_shardRecombineCollations_993538.patch, SOLR-2010_shardSearchHandler_993538.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010.patch New Patch Version with Shard Support. Grant, I hope I'm getting closer to what you have in mind this time around. I think I've figured how to send the collation test queries back to SearchHandler and have it take care of querying the shards individually. Then the collation logic is no different for distributed / non-distributed. As I would like to eventually use this in production here, any comments as to how to further make this a production-quality feature are much appreciated. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str str name=faillfaith/str /lst /lst lst name=collation
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010.patch Third version (with .patch extension. I had used .txt extension with 2nd version). Works with trunk rev#986945. This time SpellCheckCollator calls the SearchHandler instead of calling the QueryComponent. This required exposing a reference to the SearchHandler on the ResponseBuilder. Also a new overloaded method in SearchHandler.processRequestBody() lets you override the list of components to run. In this case we just have it run QueryComponent. This revision has 2 potential benefits: (1) the overloaded method in SearchHandler may prove useful to other components in the future. (2) there may be a way to get SearchHandler to requery all the shards at once and then there would be no need to reintegrate the Collations in SearchHandler.finishStage(). However, see my comment in SpellCheckCollator lines 56-57. Likely I am calling SpellCheckCollator during the wrong stage of the distributed request but I a need to find out more specifically how shards work to determine how to further improve this here. As time allows I will do my own investigating but anyone's advice would be greatly appreciated. Finally, this version corrects a bug that would have caused one of the test scenarios in DistributedSpellCheckComponentTest to fail. Unfortunately in the 2nd version, I had left some scenarios commented-out and did not catch this until now. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010.txt Second version of patch. Updated to trunk rev #986945. Adds support for shards. I originally implemented this by passing the SearchHandler to the SpellCheckComponent and then using an overloaded version of SearchHandler.handleRequestBody() to do the re-queries. I found this was unnecessary as we get the same results by calling the QueryComponent directly. I added some test scenarios to DistributedSpellCheckComponentTest and all pass. However, I am a bit disturbed to find that the test fails if I uncomment the constructor (added with this patch). The constructor simply tells it to test only with 4 shards rather than trying 1 shard, then 2, etc. I found either way the 4-shard test results in the same docs going to the same shards. Yet the results are different. Specifically the ranking/ordering of the collations returned and the # of hits reported are sometimes wrong when the constructor is called before the test. Unfortunately I am at a loss as to why I get inconsistent results here and anyone's assistance on this would be most helpful. I also added an additional unit test method to verify this works when multiple request handlers are configured with different qf params. I also added a unit test method that verifies this works when fq is set. Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch, SOLR-2010.txt Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-2010: -- Attachment: SOLR-2010.patch Added license headers Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2010.patch, SOLR-2010.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str str name=faillfaith/str /lst /lst lst name=collation str name=collationQueryTitle:(chops AND all)/str int name=hits1/int lst name=misspellingsAndCorrections str name=hopqchops/str str name=faillall/str /lst /lst /lst /lst In addition, SOLRJ is updated to include SpellCheckResponse.getCollatedResults(), which will return the expanded Collation format. getCollatedResult(), which returns a
[jira] Updated: (SOLR-2010) Improvements to SpellCheckComponent Collate functionality
[ https://issues.apache.org/jira/browse/SOLR-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Dyer updated SOLR-2010: - Attachment: SOLR-2010.patch Tested against branch version #96633 Improvements to SpellCheckComponent Collate functionality - Key: SOLR-2010 URL: https://issues.apache.org/jira/browse/SOLR-2010 Project: Solr Issue Type: New Feature Components: clients - java, spellchecker Affects Versions: 1.4.1 Environment: Tested against trunk revision 966633 Reporter: James Dyer Priority: Minor Attachments: SOLR-2010.patch Improvements to SpellCheckComponent Collate functionality Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features. 1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits. 2. Provide the option to get multiple collation suggestions 3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction. This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause. This patch adds the following spellcheck parameters: 1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations). 2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior. 3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context): lst name=spellcheck lst name=suggestions lst name=hopq int name=numFound94/int int name=startOffset7/int int name=endOffset11/int arr name=suggestion strhope/str strhow/str strhope/str strchops/str strhoped/str etc /arr lst name=faill int name=numFound100/int int name=startOffset16/int int name=endOffset21/int arr name=suggestion strfall/str strfails/str strfail/str strfill/str strfaith/str strall/str etc /arr /lst lst name=collation str name=collationQueryTitle:(how AND fails)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhow/str str name=faillfails/str /lst /lst lst name=collation str name=collationQueryTitle:(hope AND faith)/str int name=hits2/int lst name=misspellingsAndCorrections str name=hopqhope/str str name=faillfaith/str /lst /lst lst name=collation str name=collationQueryTitle:(chops AND all)/str int name=hits1/int lst name=misspellingsAndCorrections str name=hopqchops/str str name=faillall/str /lst /lst /lst /lst In addition, SOLRJ is updated to include SpellCheckResponse.getCollatedResults(), which will return the expanded Collation format. getCollatedResult(), which returns a single String, is retained for