Hello,
I setup nutch 1.3 to crawl our mediawiki instance.
Somewhere during the crawling process I get an error that stops
everything:
---------
LinkDb: starting at 2011-08-01 09:27:51
LinkDb: linkdb: crawl-301/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801084037
LinkDb: adding segment:
[20 more of that]
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801083518/parse_data
Input path does not exist:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091638/parse_data
Input path does not exist:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091806/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
---------
What can I do to fix this?
--
Viele Grüße
Christian Weiske