Hello, I'm trying to get whole-web crawling working, but I'm getting this error in the final indexing steps:
LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_fetch LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_generate LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_parse LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data Input path does not exist: file:/data/crawl/segments/20090903093154/parse_text/parse_data Can anyone help? This is running nutch 1.0 on a clean nutch install on Fedora Linux. I've verified the same error using the nutch-2009-09-03_05-18-47 release as well. My script and full error output are below. Thanks -------------------------------------------------------------------- Nutch Script ------------------------------------------------------------------------- #!/bin/bash export JAVA_HOME=/usr/local/jdk # Clean up from last run /bin/rm -rf crawl seed mkdir seed # Copy list of urls to the seed directory cp urls seed/urls.txt # Injects urls in the 'seed' directory into the crawldb /usr/local/nutch/bin/nutch inject crawl/crawldb seed # Generate fetch list, fetch and parse content /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments echo DONE GENERATE 1 # The above command will generate a new segment directory # under crawl/segments that at this point contains files that # store the url(s) to be fetched. In the following commands # we need the latest segment dir as parameter so well store # it in an environment variable: SEGMENT=`ls -d crawl/segments/2* | tail -1` echo SEGMENT 1: $SEGMENT # Now launch the fetcher that actually goes to get the content /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing echo DONE FETCH 1 # Next, parse the content /usr/local/nutch/bin/nutch parse $SEGMENT echo DONE PARSE 1 # Then update the Nutch crawldb. The updatedb command will # store all new urls discovered during the fetch and parse of # the previous segment into Nutch database so they can be # fetched later. Nutch also stores information about the # pages that were fetched so the same urls wont be fetched # again and again. /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize echo DONE UPDATEDB 1 # Now the database has entries for all of the pages referenced by the initial set # Now we fetch a new segment with the top-scoring 1000 pages # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN 1000 /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments echo DONE GENERATE 2 # reset SEGMENT SEGMENT=`ls -d crawl/segments/2* | tail -1` echo SEGMENT 2: $SEGMENT # Now re-launch the fetcher to get the content /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing echo DONE FETCH 2 # Next, parse the content /usr/local/nutch/bin/nutch parse $SEGMENT echo DONE PARSE 2 # update the db /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize echo DONE UPDATE 2 # Fetch another round # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN 1000 /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments echo DONE GENERATE 3 # reset SEGMENT SEGMENT=`ls -d crawl/segments/2* | tail -1` echo SEGMENT 3: $SEGMENT # Now re-launch the fetcher to get the content /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing echo DONE FETCH 3 # Next, parse the content /usr/local/nutch/bin/nutch parse $SEGMENT echo DONE PARSE 3 # update the db /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize echo DONE UPDATEDB 3 # # We now index what we've gotten # # Before indexing we first invert all of the links, # so that we may index incoming anchor text with the pages. /usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments/* # Then index /usr/local/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* ---------------------------------------------------------------- Nutch Errors --------------------------------------------------------- -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done DONE FETCH 3 DONE PARSE 3 CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20090903093336] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done DONE UPDATEDB 3 LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_fetch LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_generate LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_parse LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data Input path does not exist: file:/data/crawl/segments/20090903093154/parse_text/parse_data Input path does not exist: file:/data/crawl/segments/20090903093154/crawl_fetch/parse_data Input path does not exist: file:/data/crawl/segments/20090903093154/crawl_generate/parse_data Input path does not exist: file:/data/crawl/segments/20090903093154/content/parse_data Input path does not exist: file:/data/crawl/segments/20090903093154/crawl_parse/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) Indexer: starting Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/data/crawl/linkdb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
