That worked perfectly! Thanks On Thu, Sep 3, 2009 at 12:03 PM, Julien Nioche < [email protected]> wrote:
> Haven't checked but I expect the correct command to be : > */usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments* > without the trailing /* > > J. > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > > nutch invertlinks > 2009/9/3 Tom Gardner <[email protected]> > > > Hello, > > > > I'm trying to get whole-web crawling working, but I'm getting this error > in > > the final indexing steps: > > > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: > file:/data/crawl/segments/20090903093154/parse_data > > LinkDb: adding segment: > file:/data/crawl/segments/20090903093154/parse_text > > LinkDb: adding segment: > > file:/data/crawl/segments/20090903093154/crawl_fetch > > LinkDb: adding segment: > > file:/data/crawl/segments/20090903093154/crawl_generate > > LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content > > LinkDb: adding segment: > > file:/data/crawl/segments/20090903093154/crawl_parse > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does > not > > exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data > > Input path does not exist: > > file:/data/crawl/segments/20090903093154/parse_text/parse_data > > Can anyone help? This is running nutch 1.0 on a clean nutch install on > > Fedora Linux. > > > > I've verified the same error using the nutch-2009-09-03_05-18-47 release > as > > well. > > > > My script and full error output are below. > > > > Thanks > > > > > > -------------------------------------------------------------------- > Nutch > > Script > > ------------------------------------------------------------------------- > > > > #!/bin/bash > > export JAVA_HOME=/usr/local/jdk > > > > # Clean up from last run > > /bin/rm -rf crawl seed > > mkdir seed > > # Copy list of urls to the seed directory > > cp urls seed/urls.txt > > # Injects urls in the 'seed' directory into the crawldb > > /usr/local/nutch/bin/nutch inject crawl/crawldb seed > > # Generate fetch list, fetch and parse content > > /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments > > echo DONE GENERATE 1 > > # The above command will generate a new segment directory > > # under crawl/segments that at this point contains files that > > # store the url(s) to be fetched. In the following commands > > # we need the latest segment dir as parameter so well store > > # it in an environment variable: > > SEGMENT=`ls -d crawl/segments/2* | tail -1` > > echo SEGMENT 1: $SEGMENT > > # Now launch the fetcher that actually goes to get the content > > /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing > > echo DONE FETCH 1 > > # Next, parse the content > > /usr/local/nutch/bin/nutch parse $SEGMENT > > echo DONE PARSE 1 > > # Then update the Nutch crawldb. The updatedb command will > > # store all new urls discovered during the fetch and parse of > > # the previous segment into Nutch database so they can be > > # fetched later. Nutch also stores information about the > > # pages that were fetched so the same urls wont be fetched > > # again and again. > > /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter > > -normalize > > echo DONE UPDATEDB 1 > > # Now the database has entries for all of the pages referenced by the > > initial set > > > > # Now we fetch a new segment with the top-scoring 1000 pages > > # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN > > 1000 > > /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments > > echo DONE GENERATE 2 > > # reset SEGMENT > > SEGMENT=`ls -d crawl/segments/2* | tail -1` > > echo SEGMENT 2: $SEGMENT > > # Now re-launch the fetcher to get the content > > /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing > > echo DONE FETCH 2 > > # Next, parse the content > > /usr/local/nutch/bin/nutch parse $SEGMENT > > echo DONE PARSE 2 > > # update the db > > /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter > > -normalize > > echo DONE UPDATE 2 > > > > # Fetch another round > > # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN > > 1000 > > /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments > > echo DONE GENERATE 3 > > # reset SEGMENT > > SEGMENT=`ls -d crawl/segments/2* | tail -1` > > echo SEGMENT 3: $SEGMENT > > # Now re-launch the fetcher to get the content > > /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing > > echo DONE FETCH 3 > > # Next, parse the content > > /usr/local/nutch/bin/nutch parse $SEGMENT > > echo DONE PARSE 3 > > # update the db > > /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter > > -normalize > > echo DONE UPDATEDB 3 > > > > # > > # We now index what we've gotten > > # > > # Before indexing we first invert all of the links, > > # so that we may index incoming anchor text with the pages. > > /usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments/* > > # Then index > > /usr/local/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb > > crawl/segments/* > > > > ---------------------------------------------------------------- Nutch > > Errors --------------------------------------------------------- > > > > > > > > > > > > > > > > > > -finishing thread FetcherThread, activeThreads=9 > > -finishing thread FetcherThread, activeThreads=8 > > -finishing thread FetcherThread, activeThreads=7 > > -finishing thread FetcherThread, activeThreads=6 > > -finishing thread FetcherThread, activeThreads=5 > > -finishing thread FetcherThread, activeThreads=4 > > -finishing thread FetcherThread, activeThreads=3 > > -finishing thread FetcherThread, activeThreads=2 > > -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=0 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: done > > DONE FETCH 3 > > DONE PARSE 3 > > CrawlDb update: starting > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20090903093336] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > CrawlDb update: Merging segment data into db. > > CrawlDb update: done > > DONE UPDATEDB 3 > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: > file:/data/crawl/segments/20090903093154/parse_data > > LinkDb: adding segment: > file:/data/crawl/segments/20090903093154/parse_text > > LinkDb: adding segment: > > file:/data/crawl/segments/20090903093154/crawl_fetch > > LinkDb: adding segment: > > file:/data/crawl/segments/20090903093154/crawl_generate > > LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content > > LinkDb: adding segment: > > file:/data/crawl/segments/20090903093154/crawl_parse > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does > not > > exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data > > Input path does not exist: > > file:/data/crawl/segments/20090903093154/parse_text/parse_data > > Input path does not exist: > > file:/data/crawl/segments/20090903093154/crawl_fetch/parse_data > > Input path does not exist: > > file:/data/crawl/segments/20090903093154/crawl_generate/parse_data > > Input path does not exist: > > file:/data/crawl/segments/20090903093154/content/parse_data > > Input path does not exist: > > file:/data/crawl/segments/20090903093154/crawl_parse/parse_data > > at > > > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > > at > > > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > > at > > > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) > > Indexer: starting > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does > > not > > exist: file:/data/crawl/linkdb/current > > at > > > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > > at > > > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > > at > > > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > > at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) > > at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) > > >
