[ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513167 ]
Sharad Agarwal edited comment on SOLR-303 at 7/17/07 12:27 AM: --------------------------------------------------------------- > "Index view consistency between multiple requests" requirement is relaxed in > this implementation. >>Do you have plans to remedy that? Or do you think that most people are OK >>with inconsistencies that could arise? The thing to note here is that currently multi phase execution is based on document unique fields, NOT on doc internal ids. So there wont be much inconsistencies between requests; as it does not depend on changing internal doc ids. The possibility is that a particular document may have been deleted when the second phase executes.; which in my opinion should be OK to live with. Other possibility could be the document is changed and original query terms are not present in the document anymore. This can be solved by doing a AND with the original query and uniq field document query. If people think it is really crucial to have index view consistency, then it should be easy to implement "Consistency via Retry" as mentioned in http://wiki.apache.org/solr/FederatedSearch >>It might also be the case that a custom partitioning function could be >>implemented (such as improving caching by partitioning queries, etc) or it >>may >>be more efficient to do the second phase of a query on the same shard >>copy as the first phase. >>In that case it might make sense load balancing across shards from Solr. For second phase of a query to execute on the same shard copy, third party "Sticky load balancers" can be used. I believe Apache already does that. All copies of a single partition can sit behind the Apache load balancer (doing the "Sticky"). The merger just needs to know about the Load-balancer ip/port for each partition. Now based on the query, merger can search the appropriate partitions only. To improve the caching, Solr itself has to do the load balancing. Other option could be to introduce the query result cache at the merger itself. >>Where are terms extracted from (some queries require index access)? This >>should be delegated to the shards, no?It can be the same step that gets >>the >>docFreqs from the shards (pass the query, *not* the terms). yes, if thats the case, should be easy to implement as you have suggested. >>I think we should base the solution on something like >>https://issues.apache.org/jira/browse/SOLR-281 cool, I was looking for something like this. This looks like the way to go. >>Any thoughts on RMI vs HTTP for the searcher-subsearcher interface? RMI could be supported as an option by enhancing the ResponseParser (better name ??) interface. The remote search server can directly return the SolrQueryResponse object. I understand that there will be some performance benefit if doing the native java marshalling/unmarshalling of object; instead of Solr response writing and then parsing (if done the HTTP way). The question we need to answer is: Is the effort/complexity worth it? In our organization we made a conscious decision to go for HTTP. The operation folks like HTTP as it is standard stuff, load balancing, monitoring etc. Lot of tools already available for it. With RMI, I am not sure external Sticky load-balancing is possible; the merger itself has to build the logic. Moreover, I think HTTP fits more naturally with Solr in its Request handler model. was: > "Index view consistency between multiple requests" requirement is relaxed in > this implementation. >>Do you have plans to remedy that? Or do you think that most people are OK >>with inconsistencies that could arise? The thing to note here is that currently multi phase execution is based on document unique fields, NOT on doc internal ids. So there wont be much inconsistencies between requests; as it does not depend on changing internal doc ids. The possibility is that a particular document may have been deleted when the second phase executes.; which in my opinion should be OK to live with. Other possibility could be the document is changed and original query terms are not present in the document anymore. This can be solved by doing a AND with the original query and uniq field document query. If people think it is really crucial to have index view consistency, then it should be easy to implement "Consistency via Retry" as mentioned in http://wiki.apache.org/solr/FederatedSearch "Consistency via specifying Index version" would be little involved. Session management with "Sticky" load balancers could be explored. >>It might also be the case that a custom partitioning function could be >>implemented (such as improving caching by partitioning queries, etc) or it >>may >>be more efficient to do the second phase of a query on the same shard >>copy as the first phase. >>In that case it might make sense load balancing across shards from Solr. For second phase of a query to execute on the same shard copy, third party "Sticky load balancers" can be used. I believe Apache already does that. All copies of a single partition can sit behind the Apache load balancer (doing the "Sticky"). The merger just needs to know about the Load-balancer ip/port for each partition. Now based on the query, merger can search the appropriate partitions only. To improve the caching, Solr itself has to do the load balancing. Other option could be to introduce the query result cache at the merger itself. >>Where are terms extracted from (some queries require index access)? This >>should be delegated to the shards, no?It can be the same step that gets >>the >>docFreqs from the shards (pass the query, *not* the terms). yes, if thats the case, should be easy to implement as you have suggested. >>I think we should base the solution on something like >>https://issues.apache.org/jira/browse/SOLR-281 cool, I was looking for something like this. This looks like the way to go. >>Any thoughts on RMI vs HTTP for the searcher-subsearcher interface? RMI could be supported as an option by enhancing the ResponseParser (better name ??) interface. The remote search server can directly return the SolrQueryResponse object. I understand that there will be some performance benefit if doing the native java marshalling/unmarshalling of object; instead of Solr response writing and then parsing (if done the HTTP way). The question we need to answer is: Is the effort/complexity worth it? In our organization we made a conscious decision to go for HTTP. The operation folks like HTTP as it is standard stuff, load balancing, monitoring etc. Lot of tools already available for it. With RMI, I am not sure external Sticky load-balancing is possible; the merger itself has to build the logic. Moreover, I think HTTP fits more naturally with Solr in its Request handler model. > Federated Search over HTTP > -------------------------- > > Key: SOLR-303 > URL: https://issues.apache.org/jira/browse/SOLR-303 > Project: Solr > Issue Type: New Feature > Components: search > Reporter: Sharad Agarwal > Priority: Minor > Attachments: fedsearch.patch > > > Motivated by http://wiki.apache.org/solr/FederatedSearch > "Index view consistency between multiple requests" requirement is relaxed in > this implementation. > Does the federated search query side. Update not yet done. > Tries to achieve:- > ------------------------ > - The client applications are totally agnostic to federated search. The > federated search and merging of results are totally behind the scene in Solr > in request handler . Response format remains the same after merging of > results. > The response from individual shard is deserialized into SolrQueryResponse > object. The collection of SolrQueryResponse objects are merged to produce a > single SolrQueryResponse object. This enables to use the Response writers as > it is; or with minimal change. > - Efficient query processing with highlighting and fields getting generated > only for merged documents. The query is executed in 2 phases. First phase > gets the doc unique keys with sort criteria. Second phase brings all > requested fields and highlighting information. This saves lot of CPU in case > there are good number of shards and highlighting info is requested. > Should be easy to customize the query execution. For example: user can > specify to execute query in just 1 phase itself. (For some queries when > highlighting info is not required and number of fields requested are small; > this can be more efficient.) > - Ability to easily overwrite the default Federated capability by appropriate > plugins and request parameters. As federated search is performed by the > RequestHandler itself, multiple request handlers can easily be pre-configured > with different federated search settings in solrconfig.xml > - Global weight calculation is done by querying the terms' doc frequencies > from all shards. > - Federated search works on Http transport. So individual shard's VIP can be > queried. Load-balancing and Fail-over taken care by VIP as usual. > -Sub-searcher response parsing as a plugin interface. Different > implementation could be written based on JSON, xml SAX etc. Current one based > on XML DOM. > HOW: > ------- > A new RequestHandler called MultiSearchRequestHandler does the federated > search on multiple sub-searchers, (referred as "shards" going forward). It > extends the RequestHandlerBase. handleRequestBody method in > RequestHandlerBase has been divided into query building and execute methods. > This has been done to calculate global numDocs and docFreqs; and execute the > query efficiently on multiple shards. > All the "search" request handlers are expected to extend > MultiSearchRequestHandler class in order to enable federated capability for > the handler. StandardRequestHandler and DisMaxRequestHandler have been > changed to extend this class. > > The federated search kicks in if "shards" is present in the request > parameter. Otherwise search is performed as usual on the local index. eg. > shards=local,host1:port1,host2:port2 will search on the local index and 2 > remote indexes. The search response from all 3 shards are merged and serviced > back to the client. > The search request processing on the set of shards is performed as follows: > STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs > are calculated by requesting all the shards and adding up numDocs and > docFreqs from each shard. > STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs > are passed as request parameters. All document fields are NOT requested, only > document uniqFields and sort fields are requested. MoreLikeThis and > Highlighting information are NOT requested. > STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" > and "rows" params. Merged doc uniqField and sort fields are collected. Other > information like facet and debug is also merged. > STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped > based on shards. All shards in the grouping are queried for the merged doc > uniqFields (from FirstQueryPhase), highlighting and moreLikeThis info. > STEP 5: Responses from all shards from SecondQueryPhase are merged. > STEP 6: Document fields , highlighting and moreLikeThis info from > SecondQueryPhase are merged into FirstQueryPhase response. > TODO: > -Support sort field other than default score > -Support ResponseDocs in writers other than XMLWriter > -Http connection timeouts > OPEN ISSUES; > -Merging of facets by "top n terms of field f" > Scope for Performance optimization:- > -Search shards in parallel threads > -Http connection Keep-Alive ? > -Cache global numDocs and docFreqs > -Cache Query objects in handlers ?? > Would appreciate feedback on my approach. I understand that there would be > lot things I might have over-looked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.