Re: [jira] Updated: (SOLR-303) Federated Search over HTTP

Sharad Agarwal Wed, 19 Sep 2007 21:38:20 -0700

Thats correct. I am working on to have the federation components to befully pluggable with other handler components, removing the need forhandlers to extend MultiSearchRequestHandler.


-sharad


Stu Hood wrote:

Yes: my patch does not resolve the issue, it is merely to show the changes I 
had to go through to get Sharad's most recent version running properly so that 
he can incorporate them into his next revision.

It sounds like he is still working on a fully component utilizing version of 
the patch, and thus, the issue is still blocking on SOLR-281 being committed.

Thanks,
Stu


-----Original Message-----

From: patrick o'leary

Sent: Wednesday, September 19, 2007 2:48pm
To: solr-dev@lucene.apache.org
Subject: Re: [jira] Updated: (SOLR-303) Federated Search over HTTP

is this still blocked by solr-281?



Stu Hood (JIRA) wrote:

[ [https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]


Stu Hood updated SOLR-303:
--------------------------

    Attachment: fedsearch.stu.patch

I got the rest of the DF issues resolved: please refer to the attached and 
ignore my earlier comments (some of them were faulty).

Here is a patch that is very similar to your last patch, but with my fixes 
included. If you `diff fedsearch.stu.patch fedsearch.patch` you should be able 
to see what I did

The final (minor) issue I've found, is that when I strip the 'start' parameter 
in SecondQPhaseComponent.createSecondPhaseParams, it gets stripped from the 
response that is returned to the user as well (although it is honored in the 
results).

Thanks again!

Federated Search over HTTP

--------------------------

                Key: SOLR-303
                URL: [https://issues.apache.org/jira/browse/SOLR-303] 
https://issues.apache.org/jira/browse/SOLR-303
            Project: Solr
         Issue Type: New Feature
         Components: search
           Reporter: Sharad Agarwal
           Priority: Minor
        Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, 
fedsearch.stu.patch


Motivated by [http://wiki.apache.org/solr/FederatedSearch] 
http://wiki.apache.org/solr/FederatedSearch
"Index view consistency between multiple requests" requirement is relaxed in 
this implementation.
Does the federated search query side. Update not yet done.
Tries to achieve:-
------------------------
- The client applications are totally agnostic to federated search. The 
federated search and merging of results are totally behind the scene in Solr in 
request handler . Response format remains the same after merging of results.
The response from individual shard is deserialized into SolrQueryResponse 
object. The collection of SolrQueryResponse objects are merged to produce a 
single SolrQueryResponse object. This enables to use the Response writers as it 
is; or with minimal change.
- Efficient query processing with highlighting and fields getting generated 
only for merged documents. The query is executed in 2 phases. First phase gets 
the doc unique keys with sort criteria. Second phase brings all requested 
fields and highlighting information. This saves lot of CPU in case there are 
good number of shards and highlighting info is requested.
Should be easy to customize the query execution. For example: user can specify 
to execute query in just 1 phase itself. (For some queries when highlighting 
info is not required and number of fields requested are small; this can be more 
efficient.)
- Ability to easily overwrite the default Federated capability by appropriate 
plugins and request parameters. As federated search is performed by the 
RequestHandler itself, multiple request handlers can easily be pre-configured 
with different federated search settings in solrconfig.xml
- Global weight calculation is done by querying the terms' doc frequencies from 
all shards.
- Federated search works on Http transport. So individual shard's VIP can be 
queried. Load-balancing and Fail-over taken care by VIP as usual.
-Sub-searcher response parsing as a plugin interface. Different implementation 
could be written based on JSON, xml SAX etc. Current one based on XML DOM.
HOW:
-------
A new RequestHandler called MultiSearchRequestHandler does the federated search on 
multiple sub-searchers, (referred as "shards" going forward). It extends the 
RequestHandlerBase. handleRequestBody method in RequestHandlerBase has been divided into 
query building and execute methods. This has been done to calculate global numDocs and 
docFreqs; and execute the query efficiently on multiple shards.
All the "search" request handlers are expected to extend 
MultiSearchRequestHandler class in order to enable federated capability for the handler. 
StandardRequestHandler and DisMaxRequestHandler have been changed to extend this class.

The federated search kicks in if "shards" is present in the request parameter. Otherwise search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2 will search on the local index and 2 remote indexes. The search response from all 3 shards are merged and serviced back to the client.The search request processing on the set of shards is performed as follows:

STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs 
are calculated by requesting all the shards and adding up numDocs and docFreqs 
from each shard.
STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs 
are passed as request parameters. All document fields are NOT requested, only 
document uniqFields and sort fields are requested. MoreLikeThis and 
Highlighting information are NOT requested.
STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and 
"rows" params. Merged doc uniqField and sort fields are collected. Other information like facet and 
debug is also merged.
STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped 
based on shards. All shards in the grouping are queried for the merged doc 
uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info.
STEP 5: Responses from all shards from SecondQueryPhase are merged.
STEP 6: Document fields , highlighting and moreLikeThis info from 
SecondQueryPhase are merged into FirstQueryPhase response.
TODO:
-Support sort field other than default score
-Support ResponseDocs in writers other than XMLWriter
-Http connection timeouts
OPEN ISSUES;

-Merging of facets by "top n terms of field f"Scope for Performance optimization:-

-Search shards in parallel threads
-Http connection Keep-Alive ?
-Cache global numDocs and docFreqs
-Cache Query objects in handlers ??

Would appreciate feedback on my approach. I understand that there would be lot things I might have over-looked.

Re: [jira] Updated: (SOLR-303) Federated Search over HTTP

Reply via email to