Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There
are series of MapReduce jobs that i am performing on Nutch segments to get
final output. But waiting for whole crawl to happen before running
mapreduce causes solution to run for longer time. I am now triggering
MapReduce jobs on segments as soon as they are dumped. I am running crawl
in a loop('N=depth' times ) by giving depth=1.I am getting some urls
getting lost when i crawl with depth 1 in a loop N times vs crawl giving
depth N.
Please find below pseudo code:
*Case 1*: Nutch crawl on Hadoop giving depth=3.
// Create the list object to store arguments which we are going to pass to
NUTCH
List nutchArgsList = new ArrayList();
nutchArgsList.add("-depth");
nutchArgsList.add(Integer.toString(3));
<...other nutch args...>
ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));
*Case 2*: Crawling in loop 3 times with depth='1'
for(int depthRun=0;depthRun< 3;depthRun++) {
// Create the list object to store arguments which we are going to pass to
NUTCH
List nutchArgsList = new ArrayList();
nutchArgsList.add("-depth");
nutchArgsList.add(Integer.toString(1)); //*NOTE* i have given depth as 1
here
<...other nutch args...>
ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));
}
I am getting some urls getting lost(db unfetched) when i crawling in loop
as many times as depth.
I have tried this on standalone Nutch where i run with depth 3 vs running 3
times over same urls with depth 1. I have compared the crawldb and urls
difference is only 12. But when i do the same on Hadoop using toolrunner i
am getting 1000 urls as db_unfetched.
As far i understood till now,Nutch triggers crawl in a loop as many times
as depth value. Please suggest.
Also please let me know why difference is huge when i do this on Hadoop
using toolrunner vs doing the same on standalone Nutch.
Thanks in advandce.
Regards:
Ashish V