Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-26 Thread Sharad Agarwal

Yonik Seeley wrote:


Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?
  
yes. The HTTPCore (lower level component of HTTPClient) has the 
implementation for non-blocking IO. Pl checkout

http://jakarta.apache.org/httpcomponents/httpcomponents-core/httpcore-nio/index.html

If using jdk HTTPConnection, the connection handling is done by jdk. 
System properties *http.keepAlive* and *http.maxConnections* can be used 
to control connections. Refer 
http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html


-Sharad




RE: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-22 Thread Peuss, Thomas
Hello!

What about JMX Remoting (JSR-160)? See
http://mx4j.sourceforge.net/docs/ch03.html for an implementation with
compatible licenses. It is used by Apache Geronimo as well.

The advantage from my perspective: all the communication hassle is
handled behind the scenes with configurable transports (RMI, HTTP, ...)
and it's a standard. It even supports callbacks from the server to the
registered client! This is great for asynchronous request handling.

Disadvantage: adds a dependency to an extra lib.

About Jini/Javaspaces: To see an implementation with that would be
interesting. ;-)

Just my .02
Thomas

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, September 21, 2007 8:08 PM
To: solr-dev@lucene.apache.org
Subject: HTTP or RMI, Jini, JavaSpaces for distributed search

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go about
it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can
go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly exhaust
the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because it's
inflexible.  Adding another shard requires adding another VIP in the
load-balancer, and changing which servers have which shards or adding
new copies of a shard also requires load-balancer configuration.
Everything points to Solr being able to do the load-balancing itself in
the future, and there wouldn't seem to be much benefit to using a
load-balancer w/ VIPS for each shard vs having Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini?
JavaSpaces?)

I'd like to hear from people with more experience with RMI  friends,
and what the potential downsides are to using these technologies.

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-22 Thread eks dev
Yonik, 
why not using hadoop's IPC/RPC? We use some of its (pre)historical versions 
with some small mods and it works like a charm. Nutch uses it as well. I would 
say, pretty clean and easy to use it.

We started with RMI, but we gave it up due to huge latency in range of 50ms per 
hop.  For our use case (well under 100ms response times) far too much.  We 
spent a few weeks fighting with RMI, and I am not sure if it was us making it 
not working for us.

maybe an option is to have a peek at the Apache Mina project, but you will have 
to figure out your serialization in that case (hmm, if you think speed, you 
will have to do it anyhow).



- Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, 21 September, 2007 8:08:05 PM
Subject: HTTP or RMI, Jini, JavaSpaces for distributed search

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly
exhaust the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because
it's inflexible.  Adding another shard requires adding another VIP in
the load-balancer, and changing which servers have which shards or
adding new copies of a shard also requires load-balancer
configuration..  Everything points to Solr being able to do the
load-balancing itself in the future, and there wouldn't seem to be
much benefit to using a load-balancer w/ VIPS for each shard vs having
Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini? JavaSpaces?)

I'd like to hear from people with more experience with RMI  friends,
and what the potential downsides are to using these technologies.

-Yonik





  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-22 Thread Yonik Seeley
On 9/21/07, eks dev [EMAIL PROTECTED] wrote:
 why not using hadoop's IPC/RPC?

It looks pretty close to RMI to me.

Right now, one can put Maps, Lists, Integers, arrays, etc, in a
response and have it serialized how they want (JSON, XML, etc).  If we
went the HadoopRPC way, we would need to make everything we used
implement Writeable AFAIK.

Regardless of what ends up happening, I think the communication/RPC
mechanism should be at the same level it is now for requests - that
should keep things loosely coupled and make it easy to add/change
communication mechanisms in the future if desired.

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Walter Underwood
Please don't switch to RMI. We've spent the past year converting
our entire middle tier from RMI to HTTP. We are so glad that we
no longer have any RMI servers.

The big advantage of HTTP is that there are hundreds, maybe
thousands, of engineers working on making it fast, on tools for it,
on caches, etc.

If you really need more compact responses, I would recommend
coding the JSON output in Python marshal format. That is compact,
fast, and easy to parse. We used that for a Java client in Ultraseek.

wunder

On 9/21/07 11:08 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 I wanted to take a step back for a second and think about if HTTP was
 really the right choice for the transport for distributed search.
 
 I think the high-level approach in SOLR-303 is the right way to go
 about it, but I'm unsure if HTTP is the right transport.
 
 Pro HTTP:
   - using HTTP allows one to use an http load-balancer to distribute
 load across multiple copies of the same shard by assigning a VIP
 (virtual IP) to each shard.
   - because you do pretty much everything by hand, you know that there
 isn't some hidden limitation that will jump out and bite you later.
 
 Cons HTTP:
  - you end up doing everything by hand... connection handling, request
 serialization, response parsing, etc...
  - goes through normal servlet channels... every sub-request will be
 logged to the access logs, slowing things down.
 - more network bandwidth used unless we come up with a new
 BinaryResponseWriter and Parser
 
 Currently, SOLR-303 uses and parses the XML response format, which has
 some serious downsides:
 - response size limits scalability and how deep in responses you can go...
   If you want to retrieve documents 5000 through 5009, even though the
 user only requested 10 documents, the top-level searcher needs to get
 the top 5009 documents from *each* shard... and that can quickly
 exhaust the network bandwidth of the NIC.  XML parsing on the order of
 nShards*5009 entries won't be any picnic either.
 
 I'm thinking the load-balancing of HTTP is overrated also, because
 it's inflexible.  Adding another shard requires adding another VIP in
 the load-balancer, and changing which servers have which shards or
 adding new copies of a shard also requires load-balancer
 configuration.  Everything points to Solr being able to do the
 load-balancing itself in the future, and there wouldn't seem to be
 much benefit to using a load-balancer w/ VIPS for each shard vs having
 Solr do it.
 
 So even if we stuck with HTTP, Solr would need
  - a binary protocol to minimize network bandwidth use
  - load balancing across shard copies itself
 
 Given that, would it make sense to just go with RMI instead?
 And perhaps leverage some other higher level services (Jini? JavaSpaces?)
 
 I'd like to hear from people with more experience with RMI  friends,
 and what the potential downsides are to using these technologies.
 
 -Yonik



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, Walter Underwood [EMAIL PROTECTED] wrote:
 Please don't switch to RMI. We've spent the past year converting
 our entire middle tier from RMI to HTTP. We are so glad that we
 no longer have any RMI servers.

Just to be clear for everyone, this wouldn't be a front-end change...
HTTP load balancer over top-level searches would still be the normal
way to do HA / query-load scaling.

This is more about traffic between Solr servers themselves for
distributed search (something that doesn't even exist yet).

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev
just small a notice regarding load balancing, we have used for long time one 
simple scheme where all shards were identical.  With MMAP Directory and our 
load balancer had rather simple clustering logic to direct queries to 
preffered Index Searcher Units (when available, when not to the first free  
neighbor), this simple clustering helped OS to cache Index much much better ... 
super simple setup (identical shards), but big win ...

the point I am trying to make here; with http load balancer something like this 
is rather difficult (I am not too familiar with http load balancers, but...)

- Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-dev@lucene.apache.org
Sent: Friday, 21 September, 2007 8:08:05 PM
Subject: HTTP or RMI, Jini, JavaSpaces for distributed search

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly
exhaust the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because
it's inflexible.  Adding another shard requires adding another VIP in
the load-balancer, and changing which servers have which shards or
adding new copies of a shard also requires load-balancer
configuration.  Everything points to Solr being able to do the
load-balancing itself in the future, and there wouldn't seem to be
much benefit to using a load-balancer w/ VIPS for each shard vs having
Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini? JavaSpaces?)

I'd like to hear from people with more experience with RMI  friends,
and what the potential downsides are to using these technologies.

-Yonik





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Stu Hood
There is an additional limitation to the current HTTP limitation, in that it 
does a boolean OR of all the uniq keys it needs to fetch on remote shards Once 
you query for ~100 or so documents, the URL can grow to  1 characters 
long, and Java refuses to open it. To make it work we'd need to be able to POST 
to Solr for results, which I don't think is possible yet...

I completely agree that using some kind of remote procedure call for queries to 
other shards makes sense.

Perhaps we should attempt to merge the RMI backend of SOLR-255 with the high 
level approach taken in SOLR-303? Or do you have more experience with other 
brands of RPC?

Thanks a lot,
Stu



-Original Message-
From: Yonik Seeley [EMAIL PROTECTED]
Sent: Friday, September 21, 2007 2:53pm
To: [EMAIL PROTECTED]
Subject: Re: HTTP or RMI, Jini, JavaSpaces for distributed search

On 9/21/07, Walter Underwood  wrote:
 Please don't switch to RMI. We've spent the past year converting
 our entire middle tier from RMI to HTTP. We are so glad that we
 no longer have any RMI servers.

Just to be clear for everyone, this wouldn't be a front-end change...
HTTP load balancer over top-level searches would still be the normal
way to do HA / query-load scaling.

This is more about traffic between Solr servers themselves for
distributed search (something that doesn't even exist yet).

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Mike Klaas

On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:


I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.


I don't know anything about RMI, but is it possible to do 100's of  
simultaneous asynchronous requests cheaply?  (It is with http, as you  
can use a select() loop on the sockets).


FWIW, our distributed search uses http over 120+ shards... and is  
written in python.  We will likely end up using a java solution  
eventually, though--despite great optimization efforts, python is  
having trouble keeping up.


-Mike


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, Stu Hood [EMAIL PROTECTED] wrote:
 Perhaps we should attempt to merge the RMI backend of SOLR-255 with the 
 high level approach taken in SOLR-303? Or do you have more experience with 
 other brands of RPC?

I think any approach is best done at that high level... so an RMI
interface would share some of the same logic with the dispatch filter.
 It's the high-level coarsed grained approach that will get us good
performance... trying to pretend that remote objects are local and
executing the same logic one would in a single server isn't a good
idea.

I have very little RMI experience... which is why I initially thought
of HTTP too, but that's not the best excuse :-)

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev
why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant 
with no latency like on RMI (we could not do better than 30-50ms per hop), 
Nutch uses it 




  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, Mike Klaas [EMAIL PROTECTED] wrote:
 On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:

  I wanted to take a step back for a second and think about if HTTP was
  really the right choice for the transport for distributed search.
 
  I think the high-level approach in SOLR-303 is the right way to go
  about it, but I'm unsure if HTTP is the right transport.

 I don't know anything about RMI, but is it possible to do 100's of
 simultaneous asynchronous requests cheaply?

Good question... probably only important for really big clusters (like
yours), but it would be nice.

Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?

I assume when you say async that you mean getting rid of the
thread-per-connection via NIO.  Some protocols do async by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.

Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request.  Pipelining won't work well
because things need to come back in-order.  I'm not sure if RMI has
this limitation as well.

 FWIW, our distributed search uses http over 120+ shards... and is
 written in python.

That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box.  Any unusual requirements?

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Stu Hood
Oddly enough, the system we are integrating Solr into already uses Hadoop on 
every node. It is definitely worth checking out their IPC/RPC.

I wonder how easy it is to use Hadoop RPC code without running any of the 
porcelain around it...?

Thanks,
Stu


-Original Message-
From: eks dev [EMAIL PROTECTED]
Sent: Friday, September 21, 2007 5:32pm
To: solr-dev@lucene.apache.org, [EMAIL PROTECTED]
Subject: Re: HTTP or RMI, Jini, JavaSpaces for distributed search

why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant 
with no latency like on RMI (we could not do better than 30-50ms per hop), 
Nutch uses it 




  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, eks dev [EMAIL PROTECTED] wrote:
 why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant 
 with no latency like on RMI  (we could not do better than 30-50ms per hop)

Interesting... I wonder if that 30-50ms is due to Naggel (which always
seemed to cause a 40ms delay w/ python's http lib on Linux when doing
a POST since the headers were sent separately from the body).

I've seen reports that RMI was faster:
http://mail-archives.apache.org/mod_mbox/lucene-solr-dev/200609.mbox/[EMAIL 
PROTECTED]

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Mike Klaas

On 21-Sep-07, at 2:34 PM, Yonik Seeley wrote:


On 9/21/07, Mike Klaas [EMAIL PROTECTED] wrote:

On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:

I wanted to take a step back for a second and think about if HTTP  
was

really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.


I don't know anything about RMI, but is it possible to do 100's of
simultaneous asynchronous requests cheaply?


Good question... probably only important for really big clusters (like
yours), but it would be nice.

Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?


I don't think so.  In fact, I need to make a small admendment to my  
original claim: the distribution code actually uses our internal rpc  
(which is pure python), but the other end is a python client that  
connects with solr via http (persistent, localhost connection).  I  
wrote it this way because it was easier, as our internal rpc library  
already has functionality for spitting out requests to 100's of  
clients and collecting the results asynchronously.   I figured that  
directly connecting to Solr via http would be cheaper, but perhaps it  
wouldn't be.


Both the rpc and http levels use connection-pooled persistent  
connections.



I assume when you say async that you mean getting rid of the
thread-per-connection via NIO.  Some protocols do async by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.


Right; this helps but doesn't scale too far.


Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request.  Pipelining won't work well
because things need to come back in-order.  I'm not sure if RMI has
this limitation as well.



FWIW, our distributed search uses http over 120+ shards... and is
written in python.


That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box.  Any unusual requirements?


The biggest point of customization is that we run two Solrs in a  
single webapp, one for querying and one for highlighting.  The  
highlighter Solr uses a set of custom parameters to determine the  
docs to use (I imagine the current patch does something like this as  
well).  Splitting the content from the rest of the stored fields is a  
huge win.  There is also lots of custom deduplication and caching  
logic, but this could be done as a post-processing step.


In case anything is thinking of building something this huge, I'll  
mention that it is a bad idea to try to have a single point try to  
manage so many shards.  It is preferable to go hierarchical (could be  
accomplished relatively easily if the query distributor could easily  
query other query distributor nodes).


-Mike


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev


I wonder how easy it is to use Hadoop RPC code without running any of the 
porcelain around it...?

very easy!





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev

Interesting... I wonder if that 30-50ms is due to Naggel (which always
seemed to cause a 40ms delay w/ python's http lib on Linux when doing
a POST since the headers were sent separately from the body).

it's been a while since we played with it, a year or two ago, maybe something 
changed in meantime or we did it wrong than. But I am sure we tried to disable 
Naggel optimization tcp_no_delay... anyhow, hadoop's code did the trick, I 
guess it is even better now









  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/