[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-05 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679466#action_12679466
 ] 

Mike Klaas commented on SOLR-1044:
--

{quote} I haven't yet seen a HTTP server serving more than around 1200 req/sec 
(apache HTTPD). A call based server can serve 4k-5k  messages  easily. (I am 
yet to test hadoop RPC) . The proliferation of a large no: of frameworks around 
that is a testimony to the superiority of that approach. {/quote}

up to 50,000 req/sec, with keepalive: 
http://www.litespeedtech.com/web-server-performance-comparison-litespeed-2.0-vs.html

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678987#action_12678987
 ] 

Yonik Seeley commented on SOLR-1044:


bq. An HTTP connection can be re-used only after the request-response is 
complete. meanwhile, If there is another request to be fired to the same server 
from the same client , a new connection will have to be created.

But the system quickly reaches steady state, right?  That new connection will 
be persistent and hang around for a while to be reused again when needed.

For a high-fanout distributed search, a more important part might actually be 
message parsing (independent of transport used).  I think we've done a decent 
job with the binary protocol for both CPU and network bandwidth... the actual 
requests themselves (hitting the lucene index, doing faceting and highlighting, 
retrieving stored fields) should hopefully be the bottleneck.

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-04 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679061#action_12679061
 ] 

Noble Paul commented on SOLR-1044:
--

bq.But the system quickly reaches steady state, right? That new connection will 
be persistent and hang around for a while to be reused again when needed.
the system reaches a steady state where the no:of connections would be slightly 
greater than the maximum no:of parallel requests. whereas a system using a 
message based RPC will still have only a single connection between 2 Solrs. 

bq.For a high-fanout distributed search, a more important part might actually 
be message parsing (independent of transport used)

As Solr move towards other applications such as mapreduce/mahout where the 
operations do not involve disk IO and where the payload is small there can be a 
problem.

My tests with hadoop RPC showed it outperforming tomcat when I used a small 
payload (5 bytes)

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678585#action_12678585
 ] 

Noble Paul commented on SOLR-1044:
--

bq.do not know much about Solr needs there, but we are using one of prehistoric 
versions of hadoop RPC (no NIO version)

disclaimer : I am not a hadoop user . have you used the NIO version ? how is 
the perf?



 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678587#action_12678587
 ] 

Yonik Seeley commented on SOLR-1044:


We're using persistent HTTP connections, so socket creation overhead should not 
be much of an issue.
As far as NIO - servlet containers often have NIO connectors (I guess so idle 
persistent connections don't take up a thread to listen on them).  That handles 
the receive-side.  On the sender-side, NIO shouldn't matter... all of our 
clients need a thread to keep the request context anyway - we really have no 
way of using NIO there.

There could be an issue surrounding the number of TCP connections in a large 
cluster (that's an orthogonal issue to NIO), but modern OSs seem to handle high 
numbers of connections efficiently do switches?  Or perhaps the real limit 
has to do with exhausting port numbers (65536)?

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678597#action_12678597
 ] 

Noble Paul commented on SOLR-1044:
--

bq.We're using persistent HTTP connections, so socket creation overhead should 
not be much of an issue.

An HTTP connection can be re-used only after the request-response is complete. 
meanwhile, If there is another request to be fired to the same server from the 
same client  , a new connection will have to be created. So the no:of 
connections we create will be quite high if we have a large no:of nodes in 
distributed search . 

I haven't yet seen a HTTP server serving more than around 1200 req/sec (apache 
HTTPD). A call based server can serve 4k-5k messages easily. (I am yet to test 
hadoop RPC) . The proliferation of a large no: of frameworks around that is a 
testimony to the superiority of that approach.

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678601#action_12678601
 ] 

Walter Underwood commented on SOLR-1044:


During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit rate. 
I think a 10X reduction in server load is a testimony to the superiority of the 
HTTP approach.


 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678605#action_12678605
 ] 

Shalin Shekhar Mangar commented on SOLR-1044:
-

bq. During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit 
rate. I think a 10X reduction in server load is a testimony to the superiority 
of the HTTP approach.

Nobody is replacing HTTP with RPC :)

HTTP is great but on a distributed solr deployment, it can be a bottleneck, I 
guess. I think if we do find RPC giving a better throughput than HTTP, the 
distributed search part is the right place to start using it. We do not need to 
move to non-HTTP communication (at least not now).

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678105#action_12678105
 ] 

Yonik Seeley commented on SOLR-1044:


Is our use of HTTP really a bottleneck?

My feeling has been that if we go to a call mechanism, it should be based on 
something more standard that will have many off the shelf bindings - perl, 
python, php, C, etc.

On the plus side of hadoop RPC, it could handle multiple requests per socket.  
That can also be a potential weakness though I think... a slow reader or writer 
for one request/response hangs up all the others.

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-02 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678108#action_12678108
 ] 

Ken Krugler commented on SOLR-1044:
---

I agree with both of Yonik's points:

# We'd first want to measure real-world performance before deciding that using 
something other than HTTP was important.
# Using something other than HTTP has related costs that should be considered.

At Krugle we used Hadoop RPC to handle remote searchers. In general it worked 
well, but we did run into the problem similar to what Yonik voiced as a 
potential concern - occasionally a remote searcher would hang, and when that 
happened the socket would essentially become a zombie. Under very heavy load 
testing this wound up eventually causing the entire system to lock up.

Though we heard that there were subsequent changes to the Hadoop RPC that fixed 
a number of similar bugs. Not sure about any details, though, and we never 
re-ran tests with the latest Hadoop (at that time, which was about a year ago).

If there are performance issues, I would be curious if using a long-lasting 
connection via keep-alive significantly reduces the overhead. I know that Jetty 
(for example) has a very efficient implementation of the Comet web app model, 
where you don't wind up needing a gazillion threads to handle many 
requests/second.

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-02 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678121#action_12678121
 ] 

Eks Dev commented on SOLR-1044:
---

I do not know much about Solr needs there, but we are using one of prehistoric 
versions of hadoop RPC (no NIO version)  as everything else proved to eat far 
to much time (in 800+ rq/sec environment every millisecond counts). Creating 
new Sockets is not working there as OSs start having problems to keep up with 
this rate (especially with java , slower Socket release due to gc() latency).  


We are anyhow contemplating to give etch (or thrift) a try. Etch looks like 
really good peace of work, with great flexibility. Someone tried it? 

 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-02 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678242#action_12678242
 ] 

Noble Paul commented on SOLR-1044:
--

bq.Is our use of HTTP really a bottleneck? 
we are limited by the servlet engine's ability to serve requests . I guess it 
would easily peak out at 600-800 req/sec .Whereas a NIO based system can serve 
far more with lower latency (http://www.jboss.org/netty/performance.html). If 
we have a request served out of cache (no lucene search involved) the only 
overhead will be that of the HTTP . Then there is the overhead of servlet 
engine itself . Moreover HTTP is not a very efficient for large volume small 
sized requests

bq.My feeling has been that if we go to a call mechanism, it should be based on 
something more standard that will have many off the shelf bindings - perl, 
python, php, C, etc.

I agree. Hadoop looked like a simple RPC mechanism .

bq. That can also be a potential weakness though I think... a slow reader or 
writer for one request/response hangs up all the others.

The requests on the server are served by multiple handlers (each one is a 
thread). One request will not block another if there are enough 
handlers/threads 


 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.