You must have crawled for several times, and some of them failed before the parse phase. So the parse data was not generated. You'd better delete the whole directory file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you will know the exact reason why it failed in the parse phase from the output information.
Xiao On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<[email protected]> wrote: > I installed nutch 1.0 on my laptop last night and set it running to crawl my > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > it was still running strong when I went to bed several hours later, and this > morning I woke up to this: > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl.blog/crawldb > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: crawl.blog/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > > > -- > http://www.linkedin.com/in/paultomblin >
