Re: HTTP or RMI, Jini, JavaSpaces for distributed search
Yonik Seeley wrote: Even if we go HTTP, I'm not sure it will be async at first - does HTTPClient even support async? yes. The HTTPCore (lower level component of HTTPClient) has the implementation for non-blocking IO. Pl checkout http://jakarta.apache.org/httpcomponents/httpcomponents-core/httpcore-nio/index.html If using jdk HTTPConnection, the connection handling is done by jdk. System properties *http.keepAlive* and *http.maxConnections* can be used to control connections. Refer http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html -Sharad
RE: HTTP or RMI, Jini, JavaSpaces for distributed search
Hello! What about JMX Remoting (JSR-160)? See http://mx4j.sourceforge.net/docs/ch03.html for an implementation with compatible licenses. It is used by Apache Geronimo as well. The advantage from my perspective: all the communication hassle is handled behind the scenes with configurable transports (RMI, HTTP, ...) and it's a standard. It even supports callbacks from the server to the registered client! This is great for asynchronous request handling. Disadvantage: adds a dependency to an extra lib. About Jini/Javaspaces: To see an implementation with that would be interesting. ;-) Just my .02 Thomas -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, September 21, 2007 8:08 PM To: solr-dev@lucene.apache.org Subject: HTTP or RMI, Jini, JavaSpaces for distributed search I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. Pro HTTP: - using HTTP allows one to use an http load-balancer to distribute load across multiple copies of the same shard by assigning a VIP (virtual IP) to each shard. - because you do pretty much everything by hand, you know that there isn't some hidden limitation that will jump out and bite you later. Cons HTTP: - you end up doing everything by hand... connection handling, request serialization, response parsing, etc... - goes through normal servlet channels... every sub-request will be logged to the access logs, slowing things down. - more network bandwidth used unless we come up with a new BinaryResponseWriter and Parser Currently, SOLR-303 uses and parses the XML response format, which has some serious downsides: - response size limits scalability and how deep in responses you can go... If you want to retrieve documents 5000 through 5009, even though the user only requested 10 documents, the top-level searcher needs to get the top 5009 documents from *each* shard... and that can quickly exhaust the network bandwidth of the NIC. XML parsing on the order of nShards*5009 entries won't be any picnic either. I'm thinking the load-balancing of HTTP is overrated also, because it's inflexible. Adding another shard requires adding another VIP in the load-balancer, and changing which servers have which shards or adding new copies of a shard also requires load-balancer configuration. Everything points to Solr being able to do the load-balancing itself in the future, and there wouldn't seem to be much benefit to using a load-balancer w/ VIPS for each shard vs having Solr do it. So even if we stuck with HTTP, Solr would need - a binary protocol to minimize network bandwidth use - load balancing across shard copies itself Given that, would it make sense to just go with RMI instead? And perhaps leverage some other higher level services (Jini? JavaSpaces?) I'd like to hear from people with more experience with RMI friends, and what the potential downsides are to using these technologies. -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
Yonik, why not using hadoop's IPC/RPC? We use some of its (pre)historical versions with some small mods and it works like a charm. Nutch uses it as well. I would say, pretty clean and easy to use it. We started with RMI, but we gave it up due to huge latency in range of 50ms per hop. For our use case (well under 100ms response times) far too much. We spent a few weeks fighting with RMI, and I am not sure if it was us making it not working for us. maybe an option is to have a peek at the Apache Mina project, but you will have to figure out your serialization in that case (hmm, if you think speed, you will have to do it anyhow). - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, 21 September, 2007 8:08:05 PM Subject: HTTP or RMI, Jini, JavaSpaces for distributed search I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. Pro HTTP: - using HTTP allows one to use an http load-balancer to distribute load across multiple copies of the same shard by assigning a VIP (virtual IP) to each shard. - because you do pretty much everything by hand, you know that there isn't some hidden limitation that will jump out and bite you later. Cons HTTP: - you end up doing everything by hand... connection handling, request serialization, response parsing, etc... - goes through normal servlet channels... every sub-request will be logged to the access logs, slowing things down. - more network bandwidth used unless we come up with a new BinaryResponseWriter and Parser Currently, SOLR-303 uses and parses the XML response format, which has some serious downsides: - response size limits scalability and how deep in responses you can go... If you want to retrieve documents 5000 through 5009, even though the user only requested 10 documents, the top-level searcher needs to get the top 5009 documents from *each* shard... and that can quickly exhaust the network bandwidth of the NIC. XML parsing on the order of nShards*5009 entries won't be any picnic either. I'm thinking the load-balancing of HTTP is overrated also, because it's inflexible. Adding another shard requires adding another VIP in the load-balancer, and changing which servers have which shards or adding new copies of a shard also requires load-balancer configuration.. Everything points to Solr being able to do the load-balancing itself in the future, and there wouldn't seem to be much benefit to using a load-balancer w/ VIPS for each shard vs having Solr do it. So even if we stuck with HTTP, Solr would need - a binary protocol to minimize network bandwidth use - load balancing across shard copies itself Given that, would it make sense to just go with RMI instead? And perhaps leverage some other higher level services (Jini? JavaSpaces?) I'd like to hear from people with more experience with RMI friends, and what the potential downsides are to using these technologies. -Yonik ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 9/21/07, eks dev [EMAIL PROTECTED] wrote: why not using hadoop's IPC/RPC? It looks pretty close to RMI to me. Right now, one can put Maps, Lists, Integers, arrays, etc, in a response and have it serialized how they want (JSON, XML, etc). If we went the HadoopRPC way, we would need to make everything we used implement Writeable AFAIK. Regardless of what ends up happening, I think the communication/RPC mechanism should be at the same level it is now for requests - that should keep things loosely coupled and make it easy to add/change communication mechanisms in the future if desired. -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
Please don't switch to RMI. We've spent the past year converting our entire middle tier from RMI to HTTP. We are so glad that we no longer have any RMI servers. The big advantage of HTTP is that there are hundreds, maybe thousands, of engineers working on making it fast, on tools for it, on caches, etc. If you really need more compact responses, I would recommend coding the JSON output in Python marshal format. That is compact, fast, and easy to parse. We used that for a Java client in Ultraseek. wunder On 9/21/07 11:08 AM, Yonik Seeley [EMAIL PROTECTED] wrote: I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. Pro HTTP: - using HTTP allows one to use an http load-balancer to distribute load across multiple copies of the same shard by assigning a VIP (virtual IP) to each shard. - because you do pretty much everything by hand, you know that there isn't some hidden limitation that will jump out and bite you later. Cons HTTP: - you end up doing everything by hand... connection handling, request serialization, response parsing, etc... - goes through normal servlet channels... every sub-request will be logged to the access logs, slowing things down. - more network bandwidth used unless we come up with a new BinaryResponseWriter and Parser Currently, SOLR-303 uses and parses the XML response format, which has some serious downsides: - response size limits scalability and how deep in responses you can go... If you want to retrieve documents 5000 through 5009, even though the user only requested 10 documents, the top-level searcher needs to get the top 5009 documents from *each* shard... and that can quickly exhaust the network bandwidth of the NIC. XML parsing on the order of nShards*5009 entries won't be any picnic either. I'm thinking the load-balancing of HTTP is overrated also, because it's inflexible. Adding another shard requires adding another VIP in the load-balancer, and changing which servers have which shards or adding new copies of a shard also requires load-balancer configuration. Everything points to Solr being able to do the load-balancing itself in the future, and there wouldn't seem to be much benefit to using a load-balancer w/ VIPS for each shard vs having Solr do it. So even if we stuck with HTTP, Solr would need - a binary protocol to minimize network bandwidth use - load balancing across shard copies itself Given that, would it make sense to just go with RMI instead? And perhaps leverage some other higher level services (Jini? JavaSpaces?) I'd like to hear from people with more experience with RMI friends, and what the potential downsides are to using these technologies. -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 9/21/07, Walter Underwood [EMAIL PROTECTED] wrote: Please don't switch to RMI. We've spent the past year converting our entire middle tier from RMI to HTTP. We are so glad that we no longer have any RMI servers. Just to be clear for everyone, this wouldn't be a front-end change... HTTP load balancer over top-level searches would still be the normal way to do HA / query-load scaling. This is more about traffic between Solr servers themselves for distributed search (something that doesn't even exist yet). -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
just small a notice regarding load balancing, we have used for long time one simple scheme where all shards were identical. With MMAP Directory and our load balancer had rather simple clustering logic to direct queries to preffered Index Searcher Units (when available, when not to the first free neighbor), this simple clustering helped OS to cache Index much much better ... super simple setup (identical shards), but big win ... the point I am trying to make here; with http load balancer something like this is rather difficult (I am not too familiar with http load balancers, but...) - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-dev@lucene.apache.org Sent: Friday, 21 September, 2007 8:08:05 PM Subject: HTTP or RMI, Jini, JavaSpaces for distributed search I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. Pro HTTP: - using HTTP allows one to use an http load-balancer to distribute load across multiple copies of the same shard by assigning a VIP (virtual IP) to each shard. - because you do pretty much everything by hand, you know that there isn't some hidden limitation that will jump out and bite you later. Cons HTTP: - you end up doing everything by hand... connection handling, request serialization, response parsing, etc... - goes through normal servlet channels... every sub-request will be logged to the access logs, slowing things down. - more network bandwidth used unless we come up with a new BinaryResponseWriter and Parser Currently, SOLR-303 uses and parses the XML response format, which has some serious downsides: - response size limits scalability and how deep in responses you can go... If you want to retrieve documents 5000 through 5009, even though the user only requested 10 documents, the top-level searcher needs to get the top 5009 documents from *each* shard... and that can quickly exhaust the network bandwidth of the NIC. XML parsing on the order of nShards*5009 entries won't be any picnic either. I'm thinking the load-balancing of HTTP is overrated also, because it's inflexible. Adding another shard requires adding another VIP in the load-balancer, and changing which servers have which shards or adding new copies of a shard also requires load-balancer configuration. Everything points to Solr being able to do the load-balancing itself in the future, and there wouldn't seem to be much benefit to using a load-balancer w/ VIPS for each shard vs having Solr do it. So even if we stuck with HTTP, Solr would need - a binary protocol to minimize network bandwidth use - load balancing across shard copies itself Given that, would it make sense to just go with RMI instead? And perhaps leverage some other higher level services (Jini? JavaSpaces?) I'd like to hear from people with more experience with RMI friends, and what the potential downsides are to using these technologies. -Yonik ___ Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
There is an additional limitation to the current HTTP limitation, in that it does a boolean OR of all the uniq keys it needs to fetch on remote shards Once you query for ~100 or so documents, the URL can grow to 1 characters long, and Java refuses to open it. To make it work we'd need to be able to POST to Solr for results, which I don't think is possible yet... I completely agree that using some kind of remote procedure call for queries to other shards makes sense. Perhaps we should attempt to merge the RMI backend of SOLR-255 with the high level approach taken in SOLR-303? Or do you have more experience with other brands of RPC? Thanks a lot, Stu -Original Message- From: Yonik Seeley [EMAIL PROTECTED] Sent: Friday, September 21, 2007 2:53pm To: [EMAIL PROTECTED] Subject: Re: HTTP or RMI, Jini, JavaSpaces for distributed search On 9/21/07, Walter Underwood wrote: Please don't switch to RMI. We've spent the past year converting our entire middle tier from RMI to HTTP. We are so glad that we no longer have any RMI servers. Just to be clear for everyone, this wouldn't be a front-end change... HTTP load balancer over top-level searches would still be the normal way to do HA / query-load scaling. This is more about traffic between Solr servers themselves for distributed search (something that doesn't even exist yet). -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote: I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. I don't know anything about RMI, but is it possible to do 100's of simultaneous asynchronous requests cheaply? (It is with http, as you can use a select() loop on the sockets). FWIW, our distributed search uses http over 120+ shards... and is written in python. We will likely end up using a java solution eventually, though--despite great optimization efforts, python is having trouble keeping up. -Mike
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 9/21/07, Stu Hood [EMAIL PROTECTED] wrote: Perhaps we should attempt to merge the RMI backend of SOLR-255 with the high level approach taken in SOLR-303? Or do you have more experience with other brands of RPC? I think any approach is best done at that high level... so an RMI interface would share some of the same logic with the dispatch filter. It's the high-level coarsed grained approach that will get us good performance... trying to pretend that remote objects are local and executing the same logic one would in a single server isn't a good idea. I have very little RMI experience... which is why I initially thought of HTTP too, but that's not the best excuse :-) -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant with no latency like on RMI (we could not do better than 30-50ms per hop), Nutch uses it ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 9/21/07, Mike Klaas [EMAIL PROTECTED] wrote: On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote: I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. I don't know anything about RMI, but is it possible to do 100's of simultaneous asynchronous requests cheaply? Good question... probably only important for really big clusters (like yours), but it would be nice. Even if we go HTTP, I'm not sure it will be async at first - does HTTPClient even support async? I assume when you say async that you mean getting rid of the thread-per-connection via NIO. Some protocols do async by handing off the request to another thread to wait on the response and then do a callback to the original thread - this is async with respect to the original calling thread, but still requires a thread-per-connection. Of course HTTP has some issues too - you effectively need a separate connection per outstanding request. Pipelining won't work well because things need to come back in-order. I'm not sure if RMI has this limitation as well. FWIW, our distributed search uses http over 120+ shards... and is written in python. That would be an awesome test case if you were able to use what Solr is going to provide out-of-the-box. Any unusual requirements? -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
Oddly enough, the system we are integrating Solr into already uses Hadoop on every node. It is definitely worth checking out their IPC/RPC. I wonder how easy it is to use Hadoop RPC code without running any of the porcelain around it...? Thanks, Stu -Original Message- From: eks dev [EMAIL PROTECTED] Sent: Friday, September 21, 2007 5:32pm To: solr-dev@lucene.apache.org, [EMAIL PROTECTED] Subject: Re: HTTP or RMI, Jini, JavaSpaces for distributed search why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant with no latency like on RMI (we could not do better than 30-50ms per hop), Nutch uses it ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 9/21/07, eks dev [EMAIL PROTECTED] wrote: why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant with no latency like on RMI (we could not do better than 30-50ms per hop) Interesting... I wonder if that 30-50ms is due to Naggel (which always seemed to cause a 40ms delay w/ python's http lib on Linux when doing a POST since the headers were sent separately from the body). I've seen reports that RMI was faster: http://mail-archives.apache.org/mod_mbox/lucene-solr-dev/200609.mbox/[EMAIL PROTECTED] -Yonik
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
On 21-Sep-07, at 2:34 PM, Yonik Seeley wrote: On 9/21/07, Mike Klaas [EMAIL PROTECTED] wrote: On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote: I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. I don't know anything about RMI, but is it possible to do 100's of simultaneous asynchronous requests cheaply? Good question... probably only important for really big clusters (like yours), but it would be nice. Even if we go HTTP, I'm not sure it will be async at first - does HTTPClient even support async? I don't think so. In fact, I need to make a small admendment to my original claim: the distribution code actually uses our internal rpc (which is pure python), but the other end is a python client that connects with solr via http (persistent, localhost connection). I wrote it this way because it was easier, as our internal rpc library already has functionality for spitting out requests to 100's of clients and collecting the results asynchronously. I figured that directly connecting to Solr via http would be cheaper, but perhaps it wouldn't be. Both the rpc and http levels use connection-pooled persistent connections. I assume when you say async that you mean getting rid of the thread-per-connection via NIO. Some protocols do async by handing off the request to another thread to wait on the response and then do a callback to the original thread - this is async with respect to the original calling thread, but still requires a thread-per-connection. Right; this helps but doesn't scale too far. Of course HTTP has some issues too - you effectively need a separate connection per outstanding request. Pipelining won't work well because things need to come back in-order. I'm not sure if RMI has this limitation as well. FWIW, our distributed search uses http over 120+ shards... and is written in python. That would be an awesome test case if you were able to use what Solr is going to provide out-of-the-box. Any unusual requirements? The biggest point of customization is that we run two Solrs in a single webapp, one for querying and one for highlighting. The highlighter Solr uses a set of custom parameters to determine the docs to use (I imagine the current patch does something like this as well). Splitting the content from the rest of the stored fields is a huge win. There is also lots of custom deduplication and caching logic, but this could be done as a post-processing step. In case anything is thinking of building something this huge, I'll mention that it is a bad idea to try to have a single point try to manage so many shards. It is preferable to go hierarchical (could be accomplished relatively easily if the query distributor could easily query other query distributor nodes). -Mike
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
I wonder how easy it is to use Hadoop RPC code without running any of the porcelain around it...? very easy! ___ Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
Interesting... I wonder if that 30-50ms is due to Naggel (which always seemed to cause a 40ms delay w/ python's http lib on Linux when doing a POST since the headers were sent separately from the body). it's been a while since we played with it, a year or two ago, maybe something changed in meantime or we did it wrong than. But I am sure we tried to disable Naggel optimization tcp_no_delay... anyhow, hadoop's code did the trick, I guess it is even better now ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/