Jack Tang wrote:
Below is google architecture in my brain:

                DataNode A
Master      DataNode B               GoogleCrawler
                DataNode C
                ......
GoogleCrawler is kept running all the time. One day, it gets fethlist
from DataNode A, crawls all pages and index them, then it tells Master
"I wanna to update DataNode A's index", finally it acquires "read
lock" and "write lock", and the index is updated. And some operation
is applied to DataNode B and C.

Do you have evidence that this is how Google updates their index? I've never seen much published about that.

In the future I would like to implement a more automated distributed search system than Nutch currently has. One way to do this might be to use MapReduce. Each map task's input could be an index and some segment data. The map method would serve queries, i.e., run a Nutch DistributedSearch.Server. It would first copy the index out of NDFS to the local disk, for better performance. It would never exit normally, but rather "map" forever. When a new version of the index (new set of segments, new boosts, and/or new deletions, etc.) is ready to deploy, then a new job could be submitted. If the number of map tasks (i.e., indexes) is kept equal or less than the number of nodes, and each node is permitted to run two or more tasks, then two versions of the index can be served at once. Once the new version has been deployed (listening for searches on different ports), and search front-ends are using it, then the old version can be stopped by killing its MapReduce job. If a node dies, the MapReduce job tracker would automatically re-start its task on another node.

If there were an affinity method between tasks and task trackers, then attempts could be made to re-deploy new versions of indexes whose, e.g., only boosts or deletions have changed, to the same nodes as before. Then the copy of the index to the local disk could be incremental, only copying the parts of the index/segment that have changed.

Doug

Reply via email to