[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679466#action_12679466 ] Mike Klaas commented on SOLR-1044: -- {quote} I haven't yet seen a HTTP server serving more than around 1200 req/sec (apache HTTPD). A call based server can serve 4k-5k messages easily. (I am yet to test hadoop RPC) . The proliferation of a large no: of frameworks around that is a testimony to the superiority of that approach. {/quote} up to 50,000 req/sec, with keepalive: http://www.litespeedtech.com/web-server-performance-comparison-litespeed-2.0-vs.html Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678987#action_12678987 ] Yonik Seeley commented on SOLR-1044: bq. An HTTP connection can be re-used only after the request-response is complete. meanwhile, If there is another request to be fired to the same server from the same client , a new connection will have to be created. But the system quickly reaches steady state, right? That new connection will be persistent and hang around for a while to be reused again when needed. For a high-fanout distributed search, a more important part might actually be message parsing (independent of transport used). I think we've done a decent job with the binary protocol for both CPU and network bandwidth... the actual requests themselves (hitting the lucene index, doing faceting and highlighting, retrieving stored fields) should hopefully be the bottleneck. Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679061#action_12679061 ] Noble Paul commented on SOLR-1044: -- bq.But the system quickly reaches steady state, right? That new connection will be persistent and hang around for a while to be reused again when needed. the system reaches a steady state where the no:of connections would be slightly greater than the maximum no:of parallel requests. whereas a system using a message based RPC will still have only a single connection between 2 Solrs. bq.For a high-fanout distributed search, a more important part might actually be message parsing (independent of transport used) As Solr move towards other applications such as mapreduce/mahout where the operations do not involve disk IO and where the payload is small there can be a problem. My tests with hadoop RPC showed it outperforming tomcat when I used a small payload (5 bytes) Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678585#action_12678585 ] Noble Paul commented on SOLR-1044: -- bq.do not know much about Solr needs there, but we are using one of prehistoric versions of hadoop RPC (no NIO version) disclaimer : I am not a hadoop user . have you used the NIO version ? how is the perf? Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678587#action_12678587 ] Yonik Seeley commented on SOLR-1044: We're using persistent HTTP connections, so socket creation overhead should not be much of an issue. As far as NIO - servlet containers often have NIO connectors (I guess so idle persistent connections don't take up a thread to listen on them). That handles the receive-side. On the sender-side, NIO shouldn't matter... all of our clients need a thread to keep the request context anyway - we really have no way of using NIO there. There could be an issue surrounding the number of TCP connections in a large cluster (that's an orthogonal issue to NIO), but modern OSs seem to handle high numbers of connections efficiently do switches? Or perhaps the real limit has to do with exhausting port numbers (65536)? Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678597#action_12678597 ] Noble Paul commented on SOLR-1044: -- bq.We're using persistent HTTP connections, so socket creation overhead should not be much of an issue. An HTTP connection can be re-used only after the request-response is complete. meanwhile, If there is another request to be fired to the same server from the same client , a new connection will have to be created. So the no:of connections we create will be quite high if we have a large no:of nodes in distributed search . I haven't yet seen a HTTP server serving more than around 1200 req/sec (apache HTTPD). A call based server can serve 4k-5k messages easily. (I am yet to test hadoop RPC) . The proliferation of a large no: of frameworks around that is a testimony to the superiority of that approach. Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678601#action_12678601 ] Walter Underwood commented on SOLR-1044: During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit rate. I think a 10X reduction in server load is a testimony to the superiority of the HTTP approach. Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678605#action_12678605 ] Shalin Shekhar Mangar commented on SOLR-1044: - bq. During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit rate. I think a 10X reduction in server load is a testimony to the superiority of the HTTP approach. Nobody is replacing HTTP with RPC :) HTTP is great but on a distributed solr deployment, it can be a bottleneck, I guess. I think if we do find RPC giving a better throughput than HTTP, the distributed search part is the right place to start using it. We do not need to move to non-HTTP communication (at least not now). Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678105#action_12678105 ] Yonik Seeley commented on SOLR-1044: Is our use of HTTP really a bottleneck? My feeling has been that if we go to a call mechanism, it should be based on something more standard that will have many off the shelf bindings - perl, python, php, C, etc. On the plus side of hadoop RPC, it could handle multiple requests per socket. That can also be a potential weakness though I think... a slow reader or writer for one request/response hangs up all the others. Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678108#action_12678108 ] Ken Krugler commented on SOLR-1044: --- I agree with both of Yonik's points: # We'd first want to measure real-world performance before deciding that using something other than HTTP was important. # Using something other than HTTP has related costs that should be considered. At Krugle we used Hadoop RPC to handle remote searchers. In general it worked well, but we did run into the problem similar to what Yonik voiced as a potential concern - occasionally a remote searcher would hang, and when that happened the socket would essentially become a zombie. Under very heavy load testing this wound up eventually causing the entire system to lock up. Though we heard that there were subsequent changes to the Hadoop RPC that fixed a number of similar bugs. Not sure about any details, though, and we never re-ran tests with the latest Hadoop (at that time, which was about a year ago). If there are performance issues, I would be curious if using a long-lasting connection via keep-alive significantly reduces the overhead. I know that Jetty (for example) has a very efficient implementation of the Comet web app model, where you don't wind up needing a gazillion threads to handle many requests/second. Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678121#action_12678121 ] Eks Dev commented on SOLR-1044: --- I do not know much about Solr needs there, but we are using one of prehistoric versions of hadoop RPC (no NIO version) as everything else proved to eat far to much time (in 800+ rq/sec environment every millisecond counts). Creating new Sockets is not working there as OSs start having problems to keep up with this rate (especially with java , slower Socket release due to gc() latency). We are anyhow contemplating to give etch (or thrift) a try. Etch looks like really good peace of work, with great flexibility. Someone tried it? Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678242#action_12678242 ] Noble Paul commented on SOLR-1044: -- bq.Is our use of HTTP really a bottleneck? we are limited by the servlet engine's ability to serve requests . I guess it would easily peak out at 600-800 req/sec .Whereas a NIO based system can serve far more with lower latency (http://www.jboss.org/netty/performance.html). If we have a request served out of cache (no lucene search involved) the only overhead will be that of the HTTP . Then there is the overhead of servlet engine itself . Moreover HTTP is not a very efficient for large volume small sized requests bq.My feeling has been that if we go to a call mechanism, it should be based on something more standard that will have many off the shelf bindings - perl, python, php, C, etc. I agree. Hadoop looked like a simple RPC mechanism . bq. That can also be a potential weakness though I think... a slow reader or writer for one request/response hangs up all the others. The requests on the server are served by multiple handlers (each one is a thread). One request will not block another if there are enough handlers/threads Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.