Hi, I have had same problem in one of my instances. Let's dig together, at least. I have tried to re-crawl the url list into same crawl directory (crawl-301 in your case) and got the same error, will you confirm for your case?
Best, Dincer 2011/8/1 Christian Weiske <[email protected]> > Hello, > > > I setup nutch 1.3 to crawl our mediawiki instance. > Somewhere during the crawling process I get an error that stops > everything: > > --------- > LinkDb: starting at 2011-08-01 09:27:51 > LinkDb: linkdb: crawl-301/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801084037 > LinkDb: adding segment: > [20 more of that] > > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707 > Exception in thread "main" > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801083518/parse_data > Input path does not exist: > > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091638/parse_data > Input path does not exist: > > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091806/parse_data > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) > --------- > > > What can I do to fix this? > > > -- > Viele Grüße > Christian Weiske >

