Re: Hadoop RPC for distributed Lucene

Ken Krugler Fri, 11 Jul 2008 07:28:41 -0700

I believe Hadoop RPC was originally built for distributed search forNutch. Here's some core code I think Nutch still uses<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup>http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup

Hadoop PRC is used for distributed search, but at a layer aboveLucene - search requests are sent via RPC to remote "searchers",which are Java processes running on multiple boxes. These in turnmake Lucene queries and send back results.

You might want to look at the Katta project(http://katta.wiki.sourceforge.net/), which uses Hadoop to handledistributed Lucene indexes.


-- Ken

One thing I wanted to add to the original email is if some of thecore query and filter classes implemented java.io.Externalizablethen there would be a speedup in serialization equivalent to usingWriteable. It would also be backwards compatible with and enhancethe existing distributed search using RMI. Classes that do notimplement Externalizable would simply use the default reflectionbased serialization.
On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll<<mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED]> wrote:
I believe there is a subproject over at Hadoop for doing distributedstuff w/ Lucene, but I am not sure if they are doing search side,only indexing. I was always under the impression that it was tooslow for search side, as I don't think Nutch even uses it for thesearch side of the equation, but I don't know if that is still thecase.
On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:
Has anyone taken a look at using Hadoop RPC for enabling distributedLucene? I am thinking it would implement the Searchable interfaceand use serialization to be compatible with the current RMI version.Somewhat defeats the purpose of using Hadoop RPC and serializationhowever Hadoop RPC scales far beyond what RMI can at the networkinglevel. RMI uses a thread per socket and has reportedly has latencyissues. Hadoop RPC uses NIO and is proven to scale to thousands ofservers. Serialization unfortunately must be used with Lucene dueto the Weight, Query and Filter classes. There could be an extendedversion of Searchable that allows passing Weight, Query, and Filterclasses that implement Hadoop's Writeable interface if a user wantsto bypass using serialization.
---------------------------------------------------------------------
To unsubscribe, e-mail:<mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED]For additional commands, e-mail:<mailto:[EMAIL PROTECTED]>[EMAIL PROTECTED]



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Hadoop RPC for distributed Lucene

Reply via email to