[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543553 ]
Sabyasachi Dalal commented on SOLR-303: --------------------------------------- I mean i removed the files pertaining to 281. If you follow the development above, the files pertaining to 281 were added to this patch to make it easier to apply this patch. > Federated Search over HTTP > -------------------------- > > Key: SOLR-303 > URL: https://issues.apache.org/jira/browse/SOLR-303 > Project: Solr > Issue Type: New Feature > Components: search > Reporter: Sharad Agarwal > Priority: Minor > Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, > fedsearch.patch, fedsearch.patch, fedsearch.stu.patch, fedsearch.stu.patch > > > Motivated by http://wiki.apache.org/solr/FederatedSearch > "Index view consistency between multiple requests" requirement is relaxed in > this implementation. > Does the federated search query side. Update not yet done. > Tries to achieve:- > ------------------------ > - The client applications are totally agnostic to federated search. The > federated search and merging of results are totally behind the scene in Solr > in request handler . Response format remains the same after merging of > results. > The response from individual shard is deserialized into SolrQueryResponse > object. The collection of SolrQueryResponse objects are merged to produce a > single SolrQueryResponse object. This enables to use the Response writers as > it is; or with minimal change. > - Efficient query processing with highlighting and fields getting generated > only for merged documents. The query is executed in 2 phases. First phase > gets the doc unique keys with sort criteria. Second phase brings all > requested fields and highlighting information. This saves lot of CPU in case > there are good number of shards and highlighting info is requested. > Should be easy to customize the query execution. For example: user can > specify to execute query in just 1 phase itself. (For some queries when > highlighting info is not required and number of fields requested are small; > this can be more efficient.) > - Ability to easily overwrite the default Federated capability by appropriate > plugins and request parameters. As federated search is performed by the > RequestHandler itself, multiple request handlers can easily be pre-configured > with different federated search settings in solrconfig.xml > - Global weight calculation is done by querying the terms' doc frequencies > from all shards. > - Federated search works on Http transport. So individual shard's VIP can be > queried. Load-balancing and Fail-over taken care by VIP as usual. > -Sub-searcher response parsing as a plugin interface. Different > implementation could be written based on JSON, xml SAX etc. Current one based > on XML DOM. > HOW: > ------- > A new RequestHandler called MultiSearchRequestHandler does the federated > search on multiple sub-searchers, (referred as "shards" going forward). It > extends the RequestHandlerBase. handleRequestBody method in > RequestHandlerBase has been divided into query building and execute methods. > This has been done to calculate global numDocs and docFreqs; and execute the > query efficiently on multiple shards. > All the "search" request handlers are expected to extend > MultiSearchRequestHandler class in order to enable federated capability for > the handler. StandardRequestHandler and DisMaxRequestHandler have been > changed to extend this class. > > The federated search kicks in if "shards" is present in the request > parameter. Otherwise search is performed as usual on the local index. eg. > shards=local,host1:port1,host2:port2 will search on the local index and 2 > remote indexes. The search response from all 3 shards are merged and serviced > back to the client. > The search request processing on the set of shards is performed as follows: > STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs > are calculated by requesting all the shards and adding up numDocs and > docFreqs from each shard. > STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs > are passed as request parameters. All document fields are NOT requested, only > document uniqFields and sort fields are requested. MoreLikeThis and > Highlighting information are NOT requested. > STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" > and "rows" params. Merged doc uniqField and sort fields are collected. Other > information like facet and debug is also merged. > STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped > based on shards. All shards in the grouping are queried for the merged doc > uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info. > STEP 5: Responses from all shards from SecondQueryPhase are merged. > STEP 6: Document fields , highlighting and moreLikeThis info from > SecondQueryPhase are merged into FirstQueryPhase response. > TODO: > -Support sort field other than default score > -Support ResponseDocs in writers other than XMLWriter > -Http connection timeouts > OPEN ISSUES; > -Merging of facets by "top n terms of field f" > Scope for Performance optimization:- > -Search shards in parallel threads > -Http connection Keep-Alive ? > -Cache global numDocs and docFreqs > -Cache Query objects in handlers ?? > Would appreciate feedback on my approach. I understand that there would be > lot things I might have over-looked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.