[ 
https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543553
 ] 

Sabyasachi Dalal commented on SOLR-303:
---------------------------------------

I mean i removed the files pertaining to 281. If you follow the development 
above, the files pertaining to 281 were added to this patch to make it easier 
to apply this patch.

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, 
> fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in 
> this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The 
> federated search and merging of results are totally behind the scene in Solr 
> in request handler . Response format remains the same after merging of 
> results.
> The response from individual shard is deserialized into SolrQueryResponse 
> object. The collection of SolrQueryResponse objects are merged to produce a 
> single SolrQueryResponse object. This enables to use the Response writers as 
> it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated 
> only for merged documents. The query is executed in 2 phases. First phase 
> gets the doc unique keys with sort criteria. Second phase brings all 
> requested fields and highlighting information. This saves lot of CPU in case 
> there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can 
> specify to execute query in just 1 phase itself. (For some queries when 
> highlighting info is not required and number of fields requested are small; 
> this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate 
> plugins and request parameters. As federated search is performed by the 
> RequestHandler itself, multiple request handlers can easily be pre-configured 
> with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies 
> from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be 
> queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different 
> implementation could be written based on JSON, xml SAX etc. Current one based 
> on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated 
> search on multiple sub-searchers, (referred as "shards" going forward). It 
> extends the RequestHandlerBase. handleRequestBody method in 
> RequestHandlerBase has been divided into query building and execute methods. 
> This has been done to calculate global numDocs and docFreqs; and execute the 
> query efficiently on multiple shards.
> All the "search" request handlers are expected to extend 
> MultiSearchRequestHandler class in order to enable federated capability for 
> the handler. StandardRequestHandler and DisMaxRequestHandler have been 
> changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request 
> parameter. Otherwise search is performed as usual on the local index. eg. 
> shards=local,host1:port1,host2:port2 will search on the local index and 2 
> remote indexes. The search response from all 3 shards are merged and serviced 
> back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs 
> are calculated by requesting all the shards and adding up numDocs and 
> docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs 
> are passed as request parameters. All document fields are NOT requested, only 
> document uniqFields and sort fields are requested. MoreLikeThis and 
> Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" 
> and "rows" params. Merged doc uniqField and sort fields are collected. Other 
> information like facet and debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped 
> based on shards. All shards in the grouping are queried for the merged doc 
> uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from 
> SecondQueryPhase are merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be 
> lot things I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to