Hi,

I have had same problem in one of my instances. Let's dig together, at
least. I have tried to re-crawl the url list into same crawl directory
(crawl-301 in your case) and got the same error, will you confirm for your
case?

Best,
Dincer

2011/8/1 Christian Weiske <[email protected]>

> Hello,
>
>
> I setup nutch 1.3 to crawl our mediawiki instance.
> Somewhere during the crawling process I get an error that stops
> everything:
>
> ---------
> LinkDb: starting at 2011-08-01 09:27:51
> LinkDb: linkdb: crawl-301/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801084037
> LinkDb: adding segment:
> [20 more of that]
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801083518/parse_data
> Input path does not exist:
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091638/parse_data
> Input path does not exist:
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091806/parse_data
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>        at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>        at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>        at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> ---------
>
>
> What can I do to fix this?
>
>
> --
> Viele Grüße
> Christian Weiske
>

Reply via email to