Re: HTTP or RMI, Jini, JavaSpaces for distributed search

eks dev Fri, 21 Sep 2007 11:41:44 -0700

just small a notice regarding load balancing, we have used for long time one 
simple scheme where all shards were identical.  With MMAP Directory and our 
"load balancer" had rather simple "clustering" logic to direct queries to 
preffered Index Searcher Units (when available, when not to the first free  
neighbor), this simple clustering helped OS to cache Index much much better ... 
super simple setup (identical shards), but big win ...

the point I am trying to make here; with http load balancer something like this 
is rather difficult (I am not too familiar with http load balancers, but...)

----- Original Message ----
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Sent: Friday, 21 September, 2007 8:08:05 PM
Subject: HTTP or RMI, Jini, JavaSpaces for distributed search

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly
exhaust the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because
it's inflexible.  Adding another shard requires adding another VIP in
the load-balancer, and changing which servers have which shards or
adding new copies of a shard also requires load-balancer
configuration.  Everything points to Solr being able to do the
load-balancing itself in the future, and there wouldn't seem to be
much benefit to using a load-balancer w/ VIPS for each shard vs having
Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini? JavaSpaces?)

I'd like to hear from people with more experience with RMI & friends,
and what the potential downsides are to using these technologies.

-Yonik

      ___________________________________________________________ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html

Re: HTTP or RMI, Jini, JavaSpaces for distributed search

Reply via email to