[Nutch-general] Re: Using Nutch's distributed search server mode

Doug Cutting Thu, 20 Apr 2006 15:09:06 -0700

Scott Simpson wrote:

I don't quite understand how to set up distributed searching with
relation to DFS (and the Tom White documents don't discuss this either).
There are three databases with relation to Nutch:


1. Web database (dfs)
2. Segments (regular fs)
3. The index (regular fs)

From your message above, I assume that the segments and index go in the
regular file system and the web database is distributed across dfs. We
put only a portion of the segments and index on each node and the search
is distributed from Tomcat to all the nodes at once.

If we don't use DFS for the segments and index, we'll lose the
redundancy if a node is dead and we may lose search results. Is this
true?

The distributed search code is currently a bit neglected. It doesn'tyet take advantage of MapReduce. The best way to use it today is tokeep the master copy of your segments and indexes in dfs, then, whenyou're (manually) starting distributed search servers, copy segments andindexes from dfs to temporary local storage start the distributed searchservers against those. Then construct a search-servers.txt that will bepicked up by NutchBean to construct the DistributedSearch.Client.

Long-term, I think we should automate this by having a distributedsearch MapReduce task. Each task will start by copying required data tolocal disk, starting a search server on that data, then reporting thatsearch server back through the job tracker. Currently this can be doneby setting the task's status to be the host:port string of the searchserver, then call getMapTaskReports() to get the host:port of allservers. The "map" task can then simply loop forever doing nothing. Ifa search server dies, then the MapReduce system will automatically starta new one. To launch a new version of the index, start a new suchMapReduce job, and, once it is running, switch theDistributedSearch.Client to use it's servers and kill the old job. Thetemporary space will be reclaimed when the job is killed. One will haveto be sure that the number of input "splits" naming search server tasksis no greater than numNodes*mapred.tasktracker.tasks.maximum, so thatall of the tasks will run simultaneously.


But none of that's implemented yet!

Doug


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Using Nutch's distributed search server mode

Reply via email to