The distributed searching section assumes that you have split the index into multiple pieces and there is a piece on each machine. The tutorial doesn't tell you how to split the indexes because there is not tool to do that yet. I was trying to layout a general architecture for how to do distributed searching instead of giving a step by step method. What I would do for now is to create multiple indexes of say 2-4 million pages and put each index on a separate machine. You would also need to copy all of the supporting database file such as the crawl db and link db to each machine.

Having a new index on each machine and having to create separate indexes is not the most elegant way to accomplish this architecture. The best way that we have found is to have an splitter job that indexes and splits the index and supporting databases into multiple parts on the fly. Then these parts are moved out to the search servers. We have some base code for this but it is not in the nutch codebase as of yet. If you want to move down this path send me an email.

Dennis

Giuseppe Cannella wrote:
in http://wiki.apache.org/nutch/NutchHadoopTutorial page

at 'Distributed Searching' section i read:

"On each of the search servers you would use the startup the distributed search 
server by using the nutch server command like this:
bin/nutch server 1234 /d01/local/crawled"

but /d01/local/crawled has been created only for the first server, how could i create it for all server? if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)


------------------------------------------------------
Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
http://click.libero.it/infostrada25nov06


Reply via email to