Hi Patricio, It seems to be quite a lot, but whether it is enough depends on your data size.
Regards, Arkadi > -----Original Message----- > From: Patricio Galeas [mailto:pgal...@yahoo.de] > Sent: Sunday, March 21, 2010 1:40 AM > To: nutch-user@lucene.apache.org > Subject: AW: invertlinks: Input path does not exist > > Hello Arkadi, > > I ran the crawl setting hadoop.tmp.dir to a partition with 1,5 TB. > Do you think that it is enough space for a web-crawl? > > Thanks > Pato > > > > ----- Ursprüngliche Mail ---- > Von: "arkadi.kosmy...@csiro.au" <arkadi.kosmy...@csiro.au> > An: nutch-user@lucene.apache.org > Gesendet: Freitag, den 19. März 2010, 3:56:23 Uhr > Betreff: RE: invertlinks: Input path does not exist > > I had similar problems caused by lack of space in temp directory. To > solve, edited hadoop-site.xml and set hadoop.tmp.dir to a directory with > plenty of space. > > > -----Original Message----- > > From: kevin chen [mailto:kevinc...@bdsing.com] > > Sent: Friday, March 19, 2010 1:42 PM > > To: nutch-user@lucene.apache.org > > Subject: Re: invertlinks: Input path does not exist > > > > Sounds like the last segment is corrupted. > > Did you try to remove the last segment? > > > > On Wed, 2010-03-17 at 16:10 +0000, Patricio Galeas wrote: > > > Hello all, > > > > > > I crawling the web using the > > > LanguageIdentifier plugin, but I get an error by running nutch > > > invertlinks. > > > The error occurs always by processing > > > the last segment (20100317010313-81). > > > > > > The problem is the same described in > > > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html > > > With both syntax variants of > > > invertlinks I get the same error: > > > a) nutch invertlinks crawl/linkdb -dir > > > crawl/segments > > > b) nutch invertlinks crawl/linkdb > > > crawl/segments/* > > > > > > I applied > > > https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java > > > heap problems by using the Language Identifier, but I got the same > > error. ;-( > > > > > > I set the NUTCH_HEAPSIZE with 6000 > > > (physical memory) and I merged the segments using slice=50000 > > > > > > Any idea where to look for ? > > > > > > Thanks > > > Pato > > > > > > --------------------hadoop.log---------------------------------- > > > .. > > > 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47 > > > 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68 > > > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56 > > > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12 > > > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26 > > > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73 > > > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59 > > > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30 > > > 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2 > > > 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34 > > > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52 > > > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29 > > > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24 > > > 2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb: > > org.apache.hadoop.mapred.InvalidInputException: > > > Input path does not exist: file:/mnt/nutch- > > 1.0/crawl_al/segments/20100317010313-81/parse_data > > > at > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 > > 79) > > > at > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn > > putFormat.java:39) > > > at > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19 > > 0) > > > at > > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > > > at > > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > > > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) > > > > > > __________________________________________________ > > > Do You Yahoo!? > > > Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden > Schutz > > gegen Massenmails. > > > http://mail.yahoo.com > > __________________________________________________ > Do You Yahoo!? > Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz > gegen Massenmails. > http://mail.yahoo.com