RE: invertlinks: Input path does not exist

Arkadi.Kosmynin Sun, 21 Mar 2010 23:30:18 -0700

Hi Patricio,

It seems to be quite a lot, but whether it is enough depends on your data size.


Regards,

Arkadi

> -----Original Message-----
> From: Patricio Galeas [mailto:pgal...@yahoo.de]
> Sent: Sunday, March 21, 2010 1:40 AM
> To: nutch-user@lucene.apache.org
> Subject: AW: invertlinks: Input path does not exist
> 
> Hello Arkadi,
> 
> I ran the crawl setting hadoop.tmp.dir to a partition with 1,5 TB.
> Do you think that it is enough space for a web-crawl?
> 
> Thanks
> Pato
> 
> 
> 
> ----- Ursprüngliche Mail ----
> Von: "arkadi.kosmy...@csiro.au" <arkadi.kosmy...@csiro.au>
> An: nutch-user@lucene.apache.org
> Gesendet: Freitag, den 19. März 2010, 3:56:23 Uhr
> Betreff: RE: invertlinks: Input path does not exist
> 
> I had similar problems caused by lack of space in temp directory. To
> solve, edited hadoop-site.xml and set hadoop.tmp.dir to a directory with
> plenty of space.
> 
> > -----Original Message-----
> > From: kevin chen [mailto:kevinc...@bdsing.com]
> > Sent: Friday, March 19, 2010 1:42 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: invertlinks: Input path does not exist
> >
> > Sounds like the last segment is corrupted.
> > Did you try to remove the last segment?
> >
> > On Wed, 2010-03-17 at 16:10 +0000, Patricio Galeas wrote:
> > >     Hello all,
> > >
> > > I crawling the web using the
> > > LanguageIdentifier plugin, but I get an error by running nutch
> > > invertlinks.
> > > The error occurs always by processing
> > > the last segment (20100317010313-81).
> > >
> > > The problem is the same described in
> > > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html
> > > With both syntax variants of
> > > invertlinks I get the same error:
> > > a) nutch invertlinks crawl/linkdb -dir
> > > crawl/segments
> > > b) nutch invertlinks crawl/linkdb
> > > crawl/segments/*
> > >
> > > I applied
> > > https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java
> > > heap problems by using the Language Identifier, but I got the same
> > error. ;-(
> > >
> > > I set the NUTCH_HEAPSIZE with 6000
> > > (physical memory) and I merged the segments using slice=50000
> > >
> > > Any idea where to look for ?
> > >
> > > Thanks
> > > Pato
> > >
> > > --------------------hadoop.log----------------------------------
> > > ..
> > > 2010-03-17 02:33:25,107 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47
> > > 2010-03-17 02:33:25,107 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68
> > > 2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56
> > > 2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12
> > > 2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26
> > > 2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73
> > > 2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59
> > > 2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30
> > > 2010-03-17 02:33:25,110 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2
> > > 2010-03-17 02:33:25,110 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34
> > > 2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52
> > > 2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29
> > > 2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment:
> > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24
> > > 2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb:
> > org.apache.hadoop.mapred.InvalidInputException:
> > > Input path does not exist: file:/mnt/nutch-
> > 1.0/crawl_al/segments/20100317010313-81/parse_data
> > >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> > 79)
> > >         at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> > putFormat.java:39)
> > >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> > 0)
> > >         at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> > >         at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> > >         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> > >         at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >         at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden
> Schutz
> > gegen Massenmails.
> > > http://mail.yahoo.com
> 
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
> gegen Massenmails.
> http://mail.yahoo.com

RE: invertlinks: Input path does not exist

Reply via email to