Re: distributed search

Dennis Kubes Mon, 04 Dec 2006 13:52:23 -0800

The distributed searching section assumes that you have split the indexinto multiple pieces and there is a piece on each machine. The tutorialdoesn't tell you how to split the indexes because there is not tool todo that yet. I was trying to layout a general architecture for how todo distributed searching instead of giving a step by step method. WhatI would do for now is to create multiple indexes of say 2-4 millionpages and put each index on a separate machine. You would also need tocopy all of the supporting database file such as the crawl db and linkdb to each machine.

Having a new index on each machine and having to create separate indexesis not the most elegant way to accomplish this architecture. The bestway that we have found is to have an splitter job that indexes andsplits the index and supporting databases into multiple parts on thefly. Then these parts are moved out to the search servers. We havesome base code for this but it is not in the nutch codebase as of yet.If you want to move down this path send me an email.


Dennis

Giuseppe Cannella wrote:

in http://wiki.apache.org/nutch/NutchHadoopTutorial page
at 'Distributed Searching' sectioni read:
"On each of the search servers you would use the startup the distributed search 
server by using the nutch server command like this:
bin/nutch server 1234 /d01/local/crawled"
but /d01/local/crawled has been created only for the first server, how could i create it for all server?if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)
------------------------------------------------------
Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
http://click.libero.it/infostrada25nov06

Re: distributed search

Reply via email to