Well do you have enough space in the filesystem where nutch is installed ? Because I noticed that Nutch does create some temp files in the nutch default directory, they are not all in the hadoop.tmp.dir location, not sure if it is bug or else. I did move nutch installation to a bigger filesystem to avoid this potential problem.
I assume you are using a local filesystem and not the hadoop distributed mode. 2009/7/16 Doğacan Güney <[email protected]> > On Thu, Jul 16, 2009 at 17:25, Jake Jacobson<[email protected]> > wrote: > > Thanks but isn't there an option to tell nutch where to write these files > to? > > > > There is an option to write temporary mapred files but not (for the > most part) where to > write output files for jobs. However, you can change nutch code to > write linkdb-<number> > to another directory (Take a look at LinkDb#createJob method). > > > Jake Jacobson > > > > http://www.linkedin.com/in/jakejacobson > > http://www.facebook.com/jakecjacobson > > http://twitter.com/jakejacobson > > > > Our greatest fear should not be of failure, > > but of succeeding at something that doesn't really matter. > > -- ANONYMOUS > > > > > > > > 2009/7/16 Doğacan Güney <[email protected]>: > >> On Wed, Jul 15, 2009 at 15:41, Jake Jacobson<[email protected]> > wrote: > >>> Did this with the same results. > >>> > >>> In my home directory I had a directory name "linkdb-1292468754" > >>> created with caused the process to run out of disk space. > >>> > >> > >> linkdb-<number> is not a temporary linkdb. There are two jobs that run > when you > >> run invertlinks. First is the inverting of new segments (which creates > >> the output dir > >> linkdb-<number>). Then new linkdb and old one is merged. > >> > >> I suggest playing with the hadoop compress options. It is discussed in > >> another mail > >> in this list (chronologically just a few email down). > >> > >>> In the hadoop-site.xml I have this set up > >>> > >>> <configuration> > >>> <property> > >>> <name>hadoop.tmp.dir</name> > >>> <value>/webroot/oscrawlers/nutch/tmp/</value> > >>> <description>A base for other temporary > >>> directories.</description> > >>> </property> > >>> > >>> </configuration> > >>> > >>> I am using the following command line options to run Nutch 1.0 > >>> > >>> /webroot/oscrawlers/nutch/bin/nutch crawl > >>> /webroot/oscrawlers/nutch/urls/seed.txt -dir > >>> /webroot/oscrawlers/nutch/crawl -depth 10 >& > >>> /webroot/oscrawlers/nutch/logs/crawl_log.txt > >>> > >>> In my log file I see this error message: > >>> > >>> LinkDb: adding segment: > >>> file:/webroot/oscrawlers/nutch/crawl/segments/20090714095100 > >>> Exception in thread "main" java.io.IOException: Job failed! > >>> at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) > >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > >>> > >>> Jake Jacobson > >>> > >>> http://www.linkedin.com/in/jakejacobson > >>> http://www.facebook.com/jakecjacobson > >>> http://twitter.com/jakejacobson > >>> > >>> Our greatest fear should not be of failure, > >>> but of succeeding at something that doesn't really matter. > >>> -- ANONYMOUS > >>> > >>> > >>> > >>> On Mon, Jul 13, 2009 at 9:00 AM, SunGod<[email protected]> wrote: > >>>> if you use hadoop run nutch > >>>> > >>>> please add > >>>> > >>>> <property> > >>>> <name>hadoop.tmp.dir</name> > >>>> <value>/youtempfs/hadoop-${user.name}</value> > >>>> <description>A base for other temporary directories.</description> > >>>> </property> > >>>> > >>>> to you hadoop-site.xml > >>>> > >>>> 2009/7/13 Jake Jacobson <[email protected]> > >>>> > >>>>> Hi, > >>>>> > >>>>> I have tried to run nutch 1.0 several times and it fails due to lack > >>>>> of disk space. I have defined the crawl to place all files on a disk > >>>>> that has plenty of space but when it starts building the linkdb it > >>>>> wants to put temp files in the home dir which doesn't have enough > >>>>> space. How can I force Nutch not to do this? > >>>>> > >>>>> Jake Jacobson > >>>>> > >>>>> http://www.linkedin.com/in/jakejacobson > >>>>> http://www.facebook.com/jakecjacobson > >>>>> http://twitter.com/jakejacobson > >>>>> > >>>>> Our greatest fear should not be of failure, > >>>>> but of succeeding at something that doesn't really matter. > >>>>> -- ANONYMOUS > >>>>> > >>>> > >>> > >> > >> > >> > >> -- > >> Doğacan Güney > >> > > > > > > -- > Doğacan Güney > -- -MilleBii-
