I believe Hadoop RPC was originally built for distributed search for
Nutch. Here's some core code I think Nutch still uses
<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup>http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup
Hadoop PRC is used for distributed search, but at a layer above
Lucene - search requests are sent via RPC to remote "searchers",
which are Java processes running on multiple boxes. These in turn
make Lucene queries and send back results.
You might want to look at the Katta project
(http://katta.wiki.sourceforge.net/), which uses Hadoop to handle
distributed Lucene indexes.
-- Ken
One thing I wanted to add to the original email is if some of the
core query and filter classes implemented java.io.Externalizable
then there would be a speedup in serialization equivalent to using
Writeable. It would also be backwards compatible with and enhance
the existing distributed search using RMI. Classes that do not
implement Externalizable would simply use the default reflection
based serialization.
On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll
<<mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED]> wrote:
I believe there is a subproject over at Hadoop for doing distributed
stuff w/ Lucene, but I am not sure if they are doing search side,
only indexing. I was always under the impression that it was too
slow for search side, as I don't think Nutch even uses it for the
search side of the equation, but I don't know if that is still the
case.
On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:
Has anyone taken a look at using Hadoop RPC for enabling distributed
Lucene? I am thinking it would implement the Searchable interface
and use serialization to be compatible with the current RMI version.
Somewhat defeats the purpose of using Hadoop RPC and serialization
however Hadoop RPC scales far beyond what RMI can at the networking
level. RMI uses a thread per socket and has reportedly has latency
issues. Hadoop RPC uses NIO and is proven to scale to thousands of
servers. Serialization unfortunately must be used with Lucene due
to the Weight, Query and Filter classes. There could be an extended
version of Searchable that allows passing Weight, Query, and Filter
classes that implement Hadoop's Writeable interface if a user wants
to bypass using serialization.
---------------------------------------------------------------------
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED]
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED]
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"