We used nutch for whole web crawling. In infinite loop we run tasks:
1) bin/nutch generate db <segmentsPath> -topN 10000 2) bin/nutch fetch <segment name> 3) bin/nutch updatedb db <segment name> 4) bin/nutch analyze db <segment name> 5) bin/nutch index <segment name> 6) bin/nutch dedup segments dedup.tmp After each iteration we produce new segment and may use it for search. Now we try mapred. How we can use crawl in similar way? We need results in process, but not in the end of crawling (since is very long process - weeks).