Scott Simpson wrote:
I don't quite understand how to set up distributed searching with
relation to DFS (and the Tom White documents don't discuss this either).
There are three databases with relation to Nutch:

1. Web database (dfs)
2. Segments (regular fs)
3. The index (regular fs)

From your message above, I assume that the segments and index go in the
regular file system and the web database is distributed across dfs. We
put only a portion of the segments and index on each node and the search
is distributed from Tomcat to all the nodes at once.

If we don't use DFS for the segments and index, we'll lose the
redundancy if a node is dead and we may lose search results. Is this
true?

The distributed search code is currently a bit neglected. It doesn't yet take advantage of MapReduce. The best way to use it today is to keep the master copy of your segments and indexes in dfs, then, when you're (manually) starting distributed search servers, copy segments and indexes from dfs to temporary local storage start the distributed search servers against those. Then construct a search-servers.txt that will be picked up by NutchBean to construct the DistributedSearch.Client.

Long-term, I think we should automate this by having a distributed search MapReduce task. Each task will start by copying required data to local disk, starting a search server on that data, then reporting that search server back through the job tracker. Currently this can be done by setting the task's status to be the host:port string of the search server, then call getMapTaskReports() to get the host:port of all servers. The "map" task can then simply loop forever doing nothing. If a search server dies, then the MapReduce system will automatically start a new one. To launch a new version of the index, start a new such MapReduce job, and, once it is running, switch the DistributedSearch.Client to use it's servers and kill the old job. The temporary space will be reclaimed when the job is killed. One will have to be sure that the number of input "splits" naming search server tasks is no greater than numNodes*mapred.tasktracker.tasks.maximum, so that all of the tasks will run simultaneously.

But none of that's implemented yet!

Doug


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to