[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Whitman updated SOLR-303: --- Attachment: shards.start_rows.patch Attaching patch to add a shards.start and shards.rows optional parameter. If set, they override distributed search's intelligence on setting start and rows per shard. If you set shards.start=10 and shards.rows=10, each shard will be queried with start=10 and rows=10 and you'll get back N*10 results (set rows on the main query to get it all.) [Not a java developer, my patch works but may violate good taste/style] Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Fix For: 1.3 Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards.start_rows.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-303: -- Fix Version/s: 1.3 marking as intended for 1.3 ... i'm not overly familiar with the state of this issue, but i do know that large chunks of functionality have already been committed, so i want to make sure that before 1.3 is released someone conciously decides between: * DONE ...resolving this issue * NOT DONE BUT OK ... leaving the issue unresolved and removing the 1.3 designation * NOT DONE AND NOT OK ... rolling back any/all committed code that is considered detrimental for the 1.3 release. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Fix For: 1.3 Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Kotthoff updated SOLR-303: --- Attachment: solr-dist-faceting-non-ascii-all.patch I've had a couple of issues with the current version. First, the facet queries which are sent to the other shards are posted in the URL, but aren't URL encoded, i.e. during the refine stage anything non-ascii results in facet counts for new values (i.e. the garbled version) coming back and causing NPEs when trying to update the counts. Furthermore, facet.limit=negative value isn't working as expected, i.e. instead of all facets it returns none. Also facet.sort is not automatically enabled for negative values. I've attached solr-dist-faceting-non-ascii-all.patch which fixes the above issues. Somebody who understands what everything is supposed to do should have a look over it though :) For example I've found two linked hash maps in FacetInfo, topFacets and listFacets, which seem to serve the same purpose. Therefore I replaced them by a single hash map. It seems to work just fine this way. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch, shards_qt.patch, solr-dist-faceting-non-ascii-all.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayson Minard updated SOLR-303: --- Attachment: distributed_facet_count_bugfix.patch Attached patch to fix issue with distributed search. If you specified a facet.field that was valid for the schema but not contained in a shard, an unintentional exception (array index out of bounds) would be thrown instead of returning the facet as empty. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayson Minard updated SOLR-303: --- Attachment: distributed_add_tests_for_intended_behavior.patch A few more tests to show intended behavior when facets differ between shards which is likely in the wild (missing from all but valid in schema, missing from some, and invalid field not in schema). The last test is just to ensure error behavior matches non-distributed searches. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_add_tests_for_intended_behavior.patch, distributed_facet_count_bugfix.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch New patch: - test framework using multiple embedded jetty servers that adds documents to multiple servers, and also to a control server, then executes both distributed and non-distributed queries and compares the results. - fixed merging for non-string uniqueKeyFields - fixed issue when id field was not selected by client - break facet count ties by label - added rudimentary duplicate detection in case one accidentally adds the same doc to different shards - add code to handle index changes between query phases (docs may no longer exist) Given that most of this is new functionality, I think things are in good enough shape to commit now (making it much easier for others to generate patches against it). Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch updated patch: - refactored some distributed search code to make things easier (added modifyRequest, etc) - added merging of debugging info timing info (including timing info, via generic recursive merging) - merge explain info, drops internal id from explain key for easier merging - Many small changes: don't return scores if they aren't requested (even if needed for shard requests to merge), return maxScore if scores are requested, enable escaping for shards parameter. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch New patch attached... last one had an unfinished change that prevented compilation (using the generic SolrResponse instead of SolrQueryResponse). Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch Updated patch: - face refinement requests piggyback on the requests to retrieve stored fields where possible. - fixed bug when requesting scores... don't include scores even if requested if they are not in the given DocList - fixed HTTP error codes for query parse errirs - added double/long support in sorting since we've upgraded to lucene 2.3, and changed aggregate numFound to handle long - escapeunescape comma separated ids string using backslash escaping (used to specify docs from each shard to retrieve) - other misc cleanups Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch This update adds parallel requests. - a singleton communications thread pool (executor) is added... currently static, but it should be *per core* and have a way of shutting down. - a singleton HttpClient for use by all SolrServer instances, currently static, probably fine to remain so (unless there needs to be core specific config?) - an exception causes everything to be aborted - all requests in a phase are sent out in parallel - a completion service is used for grabbing completed requests, so the first requests back can start being processed. - while receiving responses, if any new requests are put on the outgoing queue, they are immediately sent out before waiting for any further responses. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patrick o'leary updated SOLR-303: - Attachment: distributed_pjaol.patch Hey Yonik Needed to make a couple of updates to ShardDoc as the nested outer classes were preventing me from using the patch. Also included SOLR-457, with a multi threaded implementation of solrj to query the shards. with this patch. P Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed_pjaol.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch Now patch attached... this one implements count tiebreaking by index order (to match the non-distributed faceting). Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch OK, this version patches cleanly and includes some distributed faceting code. - facet.query and facet.field sorted by count is mostly handled - breaking ties by natural (index) sort order is not yet implemented - date faceting and unsorted (index order) facet.field is not implemented Assuming the user asks for the top 10 terms of a field: 1) The first facet queries piggyback on the queries to get the top ids and sort field values. 2) counts are merged, and new refinement requests are send out for those terms in the top 10 where a count was not received from some shards. Also, for terms below the top 10, we calculate the maximum it could have based on shards we have not heard from, and if that boosts it into the top 10, we include that term for refinement. 3) refinement responses are used to adjust the counts, and we are done. Note that it is theoretically possible to miss terms. A term could be just below the threshold of each shard (and thus not returned by any shard), but the total count could boost it in the top. This could be rectified by retrieving *all* terms above a specified count, but it could be expensive. The counts that are currently returned are exact. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch New patch attached... I just discovered that refinement queries weren't working because filter.query doesn't accept the new query syntax I was using to avoid having to escape field values: !field f=myfieldvalue (this should probably be committed separately, but it's in this patch for now). I put in code to over-request facet.field limit, but then commented it out for now since it too easily covers up bugs because it often prevents any refinement query logic from being exercized. Also corrected the code that always used the last element as the max possible missing count. If we requested 10 terms and only got 6, then we know that the max possible missing count is zero. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patrick o'leary updated SOLR-303: - Attachment: distributed_trunk.patch This might help, merged the distributed federated patchs with trunk last night, fixed the rejects. Appears to work. The only things not included are the distributed searcher unit tests from the previous patch. Only the deltas were in the patch, so I had no way to rebuild them. Hope this helps P Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed_trunk.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Timm updated SOLR-303: --- Comment: was deleted Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed_trunk.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (SOLR-303) Distributed Search over HTTP
User error. I thought I had a clean sandbox, but I didn't. So, the only issues I have with the patch are the 2 *Test* files previously reported, and the o.a.s.handler.SearchHandler patching file src/java/org/apache/solr/handler/SearchHandler.java Reversed (or previously applied) patch detected! Assume -R? [n] y -Sean Sean Timm (JIRA) wrote: [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Timm updated SOLR-303: --- Comment: was deleted Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, distributed_trunk.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patrick o'leary updated SOLR-303: - Attachment: (was: distributed_trunk.patch) Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch Small update, mostly to sorting - This changes sorting to get values from the Sort comparators (thus supporting custom sorts) - uses external values that can be supported by XML, also nicer for debugging - returns sort field values in an array per-field {price=[10,20,30,40,50]} - merging should be faster... lookup of sort values is by index number instead of searching for the field name. - merging short-circuits comparisons for docs in the same shard - sorting null values now works respects sortMissingFirst/Last, etc - if a shard request, don't pre-fetch docs for highlighter Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, distributed.patch, distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Attachment: distributed.patch OK, here is a *draft* that mostly works for searches and highlighting. There are stages in the request: {code} public static int STAGE_START = 0; public static int STAGE_PARSE_QUERY = 1000; public static int STAGE_EXECUTE_QUERY = 2000; public static int STAGE_GET_FIELDS = 3000; public static int STAGE_DONE= Integer.MAX_VALUE; {code} When a component wants to send a request, it adds it to outgoing queue. Other components can inspect and modify these shard requests. All components get a callback when the shard response is received. All shard responses purposes (to aid in both correlation and inspection/modification by other components). This is what a ShardRequest looks like: {code} public class ShardRequest { public final static String[] ALL_SHARDS = null; public final static int PURPOSE_PRIVATE = 0x01; public final static int PURPOSE_GET_TERM_DFS= 0x02; public final static int PURPOSE_GET_TOP_IDS = 0x04; public final static int PURPOSE_REFINE_TOP_IDS = 0x08; public final static int PURPOSE_GET_FACETS = 0x10; public final static int PURPOSE_REFINE_FACETS = 0x20; public final static int PURPOSE_GET_FIELDS = 0x40; public final static int PURPOSE_GET_HIGHLIGHTS = 0x80; public int purpose; // the purpose of this request public String[] shards; // the shards this request should be sent to // TODO: how to request a specific shard address? public ModifiableSolrParams params; public ListShardResponse responses = new ArrayListShardResponse(); } {code} Components are responsible for themselves... the highlighting component is responsible for turning itself on/off at the appropriate time... the query component has no knowledge of the highlight component. This will make it so that custom components can be developed that can work in a distributed environment w/o explicit support for that component baked into the other components. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: distributed.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Dalal updated SOLR-303: -- Comment: was deleted Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Dalal updated SOLR-303: -- Attachment: fedsearch.patch I have fixed and updated the patch with trunk version 600419. It is integrated with the re-opened SOLR-281 patch. I have added the configuration for the three distributed-search components in the solrconfig.xml, under /search request handler. So, the distributed search works with /search request only. Couple of issues : 1. The dist search components need the reference to the SearchHandler. So for now , i have hard coded the /search pattern in the FedSearchComponent. 2. Need a clean way to load common init params for the dist search components, such as timeout, thread pool size and search handler pattern. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Dalal updated SOLR-303: -- Attachment: (was: fedsearch.patch) Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Dalal updated SOLR-303: -- Attachment: fedsearch.patch I made a mistake and uploaded the wrong patch file. Now uploading the correct file. I have fixed and updated the patch with trunk version 600419. It is integrated with the re-opened SOLR-281 patch. I have added the configuration for the three distributed-search components in the solrconfig.xml, under /search request handler. So, the distributed search works with /search request only. Couple of issues : 1. The dist search components need the reference to the SearchHandler. So for now , i have hard coded the /search pattern in the FedSearchComponent. 2. Need a clean way to load common init params for the dist search components, such as timeout, thread pool size and search handler pattern. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Dalal updated SOLR-303: -- Attachment: fedsearch.patch Removed the commented line from SolrCore.loadSearchComponents and couple of debug statements. Distributed Search over HTTP Key: SOLR-303 URL: https://issues.apache.org/jira/browse/SOLR-303 Project: Solr Issue Type: New Feature Components: search Reporter: Sharad Agarwal Assignee: Yonik Seeley Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-303) Distributed Search over HTTP
[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-303: -- Description: Searching over multiple shards and aggregating results. Motivated by http://wiki.apache.org/solr/DistributedSearch was: Motivated by http://wiki.apache.org/solr/FederatedSearch Index view consistency between multiple requests requirement is relaxed in this implementation. Does the federated search query side. Update not yet done. Tries to achieve:- - The client applications are totally agnostic to federated search. The federated search and merging of results are totally behind the scene in Solr in request handler . Response format remains the same after merging of results. The response from individual shard is deserialized into SolrQueryResponse object. The collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object. This enables to use the Response writers as it is; or with minimal change. - Efficient query processing with highlighting and fields getting generated only for merged documents. The query is executed in 2 phases. First phase gets the doc unique keys with sort criteria. Second phase brings all requested fields and highlighting information. This saves lot of CPU in case there are good number of shards and highlighting info is requested. Should be easy to customize the query execution. For example: user can specify to execute query in just 1 phase itself. (For some queries when highlighting info is not required and number of fields requested are small; this can be more efficient.) - Ability to easily overwrite the default Federated capability by appropriate plugins and request parameters. As federated search is performed by the RequestHandler itself, multiple request handlers can easily be pre-configured with different federated search settings in solrconfig.xml - Global weight calculation is done by querying the terms' doc frequencies from all shards. - Federated search works on Http transport. So individual shard's VIP can be queried. Load-balancing and Fail-over taken care by VIP as usual. -Sub-searcher response parsing as a plugin interface. Different implementation could be written based on JSON, xml SAX etc. Current one based on XML DOM. HOW: --- A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple sub-searchers, (referred as shards going forward). It extends the RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into query building and execute methods. This has been done to calculate global numDocs and docFreqs; and execute the query efficiently on multiple shards. All the search request handlers are expected to extend MultiSearchRequestHandler class in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class. The federated search kicks in if shards is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client. The search request processing on the set of shards is performed as follows: STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated by requesting all the shards and adding up numDocs and docFreqs from each shard. STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed as request parameters. All document fields are NOT requested, only document uniqFields and sort fields are requested. MoreLikeThis and Highlighting information are NOT requested. STEP 3: Responses from FirstQueryPhase are merged based on sort, start and rows params. Merged doc uniqField and sort fields are collected. Other information like facet and debug is also merged. STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info. STEP 5: Responses from all shards from SecondQueryPhase are merged. STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are merged into FirstQueryPhase response. TODO: -Support sort field other than default score -Support ResponseDocs in writers other than XMLWriter -Http connection timeouts OPEN ISSUES; -Merging of facets by top n terms of field f Scope for Performance optimization:- -Search shards in parallel threads -Http connection Keep-Alive ? -Cache global numDocs and docFreqs -Cache Query objects in handlers ?? Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked.