mapred crawl

Anton Potehin Wed, 23 Nov 2005 02:45:00 -0800

We used nutch for whole web crawling.

In infinite loop we run tasks:


1) bin/nutch generate db <segmentsPath> -topN 10000

2) bin/nutch fetch <segment name>

3) bin/nutch updatedb db <segment name>  

4) bin/nutch analyze db <segment name>

5) bin/nutch index <segment name>

6) bin/nutch dedup segments dedup.tmp

 

After each iteration we produce new segment and may use it for search.

 

Now we try mapred. How we can use crawl in similar way? We need results
in process, but not in the end of crawling (since is very long process -
weeks).

mapred crawl

Reply via email to