The distributed searching section assumes that you have split the index
into multiple pieces and there is a piece on each machine. The tutorial
doesn't tell you how to split the indexes because there is not tool to
do that yet. I was trying to layout a general architecture for how to
do distributed searching instead of giving a step by step method. What
I would do for now is to create multiple indexes of say 2-4 million
pages and put each index on a separate machine. You would also need to
copy all of the supporting database file such as the crawl db and link
db to each machine.
Having a new index on each machine and having to create separate indexes
is not the most elegant way to accomplish this architecture. The best
way that we have found is to have an splitter job that indexes and
splits the index and supporting databases into multiple parts on the
fly. Then these parts are moved out to the search servers. We have
some base code for this but it is not in the nutch codebase as of yet.
If you want to move down this path send me an email.
Dennis
Giuseppe Cannella wrote:
in http://wiki.apache.org/nutch/NutchHadoopTutorial page
at 'Distributed Searching' section
i read:
"On each of the search servers you would use the startup the distributed search
server by using the nutch server command like this:
bin/nutch server 1234 /d01/local/crawled"
but /d01/local/crawled has been created only for the first server, how could i create it for all server?
if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)
------------------------------------------------------
Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
http://click.libero.it/infostrada25nov06