Re: Querying locally before sending a distributed request
I have submitted a patch for the ticket at https://issues.apache.org/jira/browse/SOLR-6832 The patch creates an option *preferLocalShards* in solrconfig.xml and in the query request params (giving more preference to the one in the query). If this option is set, HttpShardHandler.preferCurrentHostForDistributedReq() tries to find a local URL and puts that URL as the first one in the list of URLs sent to LBHttpSolrServer. This ensures that the current host's cores will be given preference for distributed queries. Current host's URL is found by ResponseBuilder.findCurrentHostAddress() by searching for current core's name in the list of shards. Default value of the option is kept as 'false' to ensure normal behavior. Before putting more effort in writing test-cases, I would like to have some comments on this patch so that I can know that I am in the right direction here. Thanks Sachin On Wed, Dec 10, 2014 at 4:30 PM, Shawn Heisey apa...@elyograg.org wrote: On 12/9/2014 10:55 PM, S G wrote: For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. I have to agree with the other replies you've gotten. Consider a SolrCloud that is handling 5000 requests per second with a replicationFactor of 20 or 30. This could be one shard or multiple shards. Currently, those requests will be load balanced to the entire cluster. If this option is implemented, suddenly EVERY request will have at least one part handled locally ... and unless the index is very tiny or 99 percent of the queries hit a Solr cache, one index core simply won't be able to handle 5000 queries per second. Getting a single machine capable of handling that load MIGHT be possible, but it would likely be *VERY* expensive. This would be great as an *OPTION* that can be enabled when the index composition and query patterns dictate it will be beneficial ... but it definitely should not be default behavior. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Querying locally before sending a distributed request
I have submitted a patch for this at https://issues.apache.org/jira/browse/SOLR-6832 Would appreciate if someone can review it. Thanks SG On Wed, Dec 10, 2014 at 4:30 PM, Shawn Heisey apa...@elyograg.org wrote: On 12/9/2014 10:55 PM, S G wrote: For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. I have to agree with the other replies you've gotten. Consider a SolrCloud that is handling 5000 requests per second with a replicationFactor of 20 or 30. This could be one shard or multiple shards. Currently, those requests will be load balanced to the entire cluster. If this option is implemented, suddenly EVERY request will have at least one part handled locally ... and unless the index is very tiny or 99 percent of the queries hit a Solr cache, one index core simply won't be able to handle 5000 queries per second. Getting a single machine capable of handling that load MIGHT be possible, but it would likely be *VERY* expensive. This would be great as an *OPTION* that can be enabled when the index composition and query patterns dictate it will be beneficial ... but it definitely should not be default behavior. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Querying locally before sending a distributed request
Just skimming, but if I'm reading this right, your suggestion is that queries be served locally rather than being forwarded to another replica when possible. So let's take the one-shard case with N replicas to make sure I understand. In a one-shard case, no query really needs to be forwarded, since any replica can fully get the results so in this case no query would be forwarded. If this is a fair summary, then consider the situation where the outside world connects to a single server rather than to a fronting load balancer. Then only one shard would be doing any work Or am I off in the weeds? That aside, if I've gotten it wrong and you want to put up a patch (or even just outline a better approach), feel free to open a JIRA and attach a patch... Best, Erick On Tue, Dec 9, 2014 at 11:55 PM, S G sg.online.em...@gmail.com wrote: Hello Solr Devs, I am a developer using Solr and wanted to have some opinion on a performance change request. Currently, I see that code flow for a query in SolrCloud is as follows: For distributed query: SolrCore - SearchHandler.handleRequestBody() - HttpShardHandler.submit() For non-distributed query: SolrCore - SearchHandler.handleRequestBody() - QueryComponent.process() For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. If this idea seems feasible, I can submit a JIRA ticket and work on it. I am planning to change SearchHandler.handleRequestBody() or HttpShardHandler.submit() Thanks SG - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Querying locally before sending a distributed request
This is a cool idea, and there are two extremes to consider. One-shard, N replicas, single connection point for consumers. This case needs forwarding. Many shards, 2 replicas each, random connection points for consumers. I think this is the case that SG had in mind. In order to meet both use cases, would it make sense to have a prefer local reads configuration option where a Core can check itself if instructed to? Mike On Wed, Dec 10, 2014 at 8:26 AM, Erick Erickson erickerick...@gmail.com wrote: Just skimming, but if I'm reading this right, your suggestion is that queries be served locally rather than being forwarded to another replica when possible. So let's take the one-shard case with N replicas to make sure I understand. In a one-shard case, no query really needs to be forwarded, since any replica can fully get the results so in this case no query would be forwarded. If this is a fair summary, then consider the situation where the outside world connects to a single server rather than to a fronting load balancer. Then only one shard would be doing any work Or am I off in the weeds? That aside, if I've gotten it wrong and you want to put up a patch (or even just outline a better approach), feel free to open a JIRA and attach a patch... Best, Erick On Tue, Dec 9, 2014 at 11:55 PM, S G sg.online.em...@gmail.com wrote: Hello Solr Devs, I am a developer using Solr and wanted to have some opinion on a performance change request. Currently, I see that code flow for a query in SolrCloud is as follows: For distributed query: SolrCore - SearchHandler.handleRequestBody() - HttpShardHandler.submit() For non-distributed query: SolrCore - SearchHandler.handleRequestBody() - QueryComponent.process() For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. If this idea seems feasible, I can submit a JIRA ticket and work on it. I am planning to change SearchHandler.handleRequestBody() or HttpShardHandler.submit() Thanks SG - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Querying locally before sending a distributed request
bq. In a one-shard case, no query really needs to be forwarded, since any replica can fully get the results so in this case no query would be forwarded. You can pass the request param distrib=false to not distribute the request in that particular case at which point it will only gather results from that particular host. As for the SolrCloud example with n-shards 1 your overall search request time is limited to the slowest shard's response time. So, you would potentially be saving one hop, but you are still making n-1 other hops to gather all of the other shard's results thus making it a moot point since you will be waiting on the other shards to respond before you can return the aggregated result list. You will then be on the hook to setup the load balancing across replicas of that one particular host you have chosen to query as Erick said which could have some gotchyas for people not expecting that behavior. -Steve On Wed, Dec 10, 2014 at 9:26 AM, Erick Erickson erickerick...@gmail.com wrote: Just skimming, but if I'm reading this right, your suggestion is that queries be served locally rather than being forwarded to another replica when possible. So let's take the one-shard case with N replicas to make sure I understand. In a one-shard case, no query really needs to be forwarded, since any replica can fully get the results so in this case no query would be forwarded. If this is a fair summary, then consider the situation where the outside world connects to a single server rather than to a fronting load balancer. Then only one shard would be doing any work Or am I off in the weeds? That aside, if I've gotten it wrong and you want to put up a patch (or even just outline a better approach), feel free to open a JIRA and attach a patch... Best, Erick On Tue, Dec 9, 2014 at 11:55 PM, S G sg.online.em...@gmail.com wrote: Hello Solr Devs, I am a developer using Solr and wanted to have some opinion on a performance change request. Currently, I see that code flow for a query in SolrCloud is as follows: For distributed query: SolrCore - SearchHandler.handleRequestBody() - HttpShardHandler.submit() For non-distributed query: SolrCore - SearchHandler.handleRequestBody() - QueryComponent.process() For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. If this idea seems feasible, I can submit a JIRA ticket and work on it. I am planning to change SearchHandler.handleRequestBody() or HttpShardHandler.submit() Thanks SG - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Querying locally before sending a distributed request
I have opened https://issues.apache.org/jira/browse/SOLR-6832 to track this. The performance gain increases if coresPerMachine is 1 and a single JVM has cores from 'k' shards. We can also look into giving more preference to machines with same IP address as current machine (when multiple tomcats are running on same machine). On Wed, Dec 10, 2014 at 7:14 AM, Steve Davids sdav...@gmail.com wrote: bq. In a one-shard case, no query really needs to be forwarded, since any replica can fully get the results so in this case no query would be forwarded. You can pass the request param distrib=false to not distribute the request in that particular case at which point it will only gather results from that particular host. As for the SolrCloud example with n-shards 1 your overall search request time is limited to the slowest shard's response time. So, you would potentially be saving one hop, but you are still making n-1 other hops to gather all of the other shard's results thus making it a moot point since you will be waiting on the other shards to respond before you can return the aggregated result list. You will then be on the hook to setup the load balancing across replicas of that one particular host you have chosen to query as Erick said which could have some gotchyas for people not expecting that behavior. -Steve On Wed, Dec 10, 2014 at 9:26 AM, Erick Erickson erickerick...@gmail.com wrote: Just skimming, but if I'm reading this right, your suggestion is that queries be served locally rather than being forwarded to another replica when possible. So let's take the one-shard case with N replicas to make sure I understand. In a one-shard case, no query really needs to be forwarded, since any replica can fully get the results so in this case no query would be forwarded. If this is a fair summary, then consider the situation where the outside world connects to a single server rather than to a fronting load balancer. Then only one shard would be doing any work Or am I off in the weeds? That aside, if I've gotten it wrong and you want to put up a patch (or even just outline a better approach), feel free to open a JIRA and attach a patch... Best, Erick On Tue, Dec 9, 2014 at 11:55 PM, S G sg.online.em...@gmail.com wrote: Hello Solr Devs, I am a developer using Solr and wanted to have some opinion on a performance change request. Currently, I see that code flow for a query in SolrCloud is as follows: For distributed query: SolrCore - SearchHandler.handleRequestBody() - HttpShardHandler.submit() For non-distributed query: SolrCore - SearchHandler.handleRequestBody() - QueryComponent.process() For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. If this idea seems feasible, I can submit a JIRA ticket and work on it. I am planning to change SearchHandler.handleRequestBody() or HttpShardHandler.submit() Thanks SG - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Querying locally before sending a distributed request
On 12/9/2014 10:55 PM, S G wrote: For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. I have to agree with the other replies you've gotten. Consider a SolrCloud that is handling 5000 requests per second with a replicationFactor of 20 or 30. This could be one shard or multiple shards. Currently, those requests will be load balanced to the entire cluster. If this option is implemented, suddenly EVERY request will have at least one part handled locally ... and unless the index is very tiny or 99 percent of the queries hit a Solr cache, one index core simply won't be able to handle 5000 queries per second. Getting a single machine capable of handling that load MIGHT be possible, but it would likely be *VERY* expensive. This would be great as an *OPTION* that can be enabled when the index composition and query patterns dictate it will be beneficial ... but it definitely should not be default behavior. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Querying locally before sending a distributed request
Hello Solr Devs, I am a developer using Solr and wanted to have some opinion on a performance change request. Currently, I see that code flow for a query in SolrCloud is as follows: For distributed query: SolrCore - SearchHandler.handleRequestBody() - HttpShardHandler.submit() For non-distributed query: SolrCore - SearchHandler.handleRequestBody() - QueryComponent.process() For a distributed query, the request is always sent to all the shards even if the originating SolrCore (handling the original distributed query) is a replica of one of the shards. If the original Solr-Core can check itself before sending http requests for any shard, we can probably save some network hopping and gain some performance. If this idea seems feasible, I can submit a JIRA ticket and work on it. I am planning to change SearchHandler.handleRequestBody() or HttpShardHandler.submit() Thanks SG