Scott Simpson wrote:
I don't quite understand how to set up distributed searching with
relation to DFS (and the Tom White documents don't discuss this either).
There are three databases with relation to Nutch:
1. Web database (dfs)
2. Segments (regular fs)
3. The index (regular fs)
From your message above, I assume that the segments and index go in the
regular file system and the web database is distributed across dfs. We
put only a portion of the segments and index on each node and the search
is distributed from Tomcat to all the nodes at once.
If we don't use DFS for the segments and index, we'll lose the
redundancy if a node is dead and we may lose search results. Is this
true?
The distributed search code is currently a bit neglected. It doesn't
yet take advantage of MapReduce. The best way to use it today is to
keep the master copy of your segments and indexes in dfs, then, when
you're (manually) starting distributed search servers, copy segments and
indexes from dfs to temporary local storage start the distributed search
servers against those. Then construct a search-servers.txt that will be
picked up by NutchBean to construct the DistributedSearch.Client.
Long-term, I think we should automate this by having a distributed
search MapReduce task. Each task will start by copying required data to
local disk, starting a search server on that data, then reporting that
search server back through the job tracker. Currently this can be done
by setting the task's status to be the host:port string of the search
server, then call getMapTaskReports() to get the host:port of all
servers. The "map" task can then simply loop forever doing nothing. If
a search server dies, then the MapReduce system will automatically start
a new one. To launch a new version of the index, start a new such
MapReduce job, and, once it is running, switch the
DistributedSearch.Client to use it's servers and kill the old job. The
temporary space will be reclaimed when the job is killed. One will have
to be sure that the number of input "splits" naming search server tasks
is no greater than numNodes*mapred.tasktracker.tasks.maximum, so that
all of the tasks will run simultaneously.
But none of that's implemented yet!
Doug
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general