just small a notice regarding load balancing, we have used for long time one simple scheme where all shards were identical. With MMAP Directory and our "load balancer" had rather simple "clustering" logic to direct queries to preffered Index Searcher Units (when available, when not to the first free neighbor), this simple clustering helped OS to cache Index much much better ... super simple setup (identical shards), but big win ...
the point I am trying to make here; with http load balancer something like this is rather difficult (I am not too familiar with http load balancers, but...) ----- Original Message ---- From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-dev@lucene.apache.org Sent: Friday, 21 September, 2007 8:08:05 PM Subject: HTTP or RMI, Jini, JavaSpaces for distributed search I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. Pro HTTP: - using HTTP allows one to use an http load-balancer to distribute load across multiple copies of the same shard by assigning a VIP (virtual IP) to each shard. - because you do pretty much everything by hand, you know that there isn't some hidden limitation that will jump out and bite you later. Cons HTTP: - you end up doing everything by hand... connection handling, request serialization, response parsing, etc... - goes through normal servlet channels... every sub-request will be logged to the access logs, slowing things down. - more network bandwidth used unless we come up with a new BinaryResponseWriter and Parser Currently, SOLR-303 uses and parses the XML response format, which has some serious downsides: - response size limits scalability and how deep in responses you can go... If you want to retrieve documents 5000 through 5009, even though the user only requested 10 documents, the top-level searcher needs to get the top 5009 documents from *each* shard... and that can quickly exhaust the network bandwidth of the NIC. XML parsing on the order of nShards*5009 entries won't be any picnic either. I'm thinking the load-balancing of HTTP is overrated also, because it's inflexible. Adding another shard requires adding another VIP in the load-balancer, and changing which servers have which shards or adding new copies of a shard also requires load-balancer configuration. Everything points to Solr being able to do the load-balancing itself in the future, and there wouldn't seem to be much benefit to using a load-balancer w/ VIPS for each shard vs having Solr do it. So even if we stuck with HTTP, Solr would need - a binary protocol to minimize network bandwidth use - load balancing across shard copies itself Given that, would it make sense to just go with RMI instead? And perhaps leverage some other higher level services (Jini? JavaSpaces?) I'd like to hear from people with more experience with RMI & friends, and what the potential downsides are to using these technologies. -Yonik ___________________________________________________________ Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html