On Thu, Jul 16, 2009 at 17:25, Jake Jacobson<[email protected]> wrote: > Thanks but isn't there an option to tell nutch where to write these files to? >
There is an option to write temporary mapred files but not (for the most part) where to write output files for jobs. However, you can change nutch code to write linkdb-<number> to another directory (Take a look at LinkDb#createJob method). > Jake Jacobson > > http://www.linkedin.com/in/jakejacobson > http://www.facebook.com/jakecjacobson > http://twitter.com/jakejacobson > > Our greatest fear should not be of failure, > but of succeeding at something that doesn't really matter. > -- ANONYMOUS > > > > 2009/7/16 Doğacan Güney <[email protected]>: >> On Wed, Jul 15, 2009 at 15:41, Jake Jacobson<[email protected]> wrote: >>> Did this with the same results. >>> >>> In my home directory I had a directory name "linkdb-1292468754" >>> created with caused the process to run out of disk space. >>> >> >> linkdb-<number> is not a temporary linkdb. There are two jobs that run when >> you >> run invertlinks. First is the inverting of new segments (which creates >> the output dir >> linkdb-<number>). Then new linkdb and old one is merged. >> >> I suggest playing with the hadoop compress options. It is discussed in >> another mail >> in this list (chronologically just a few email down). >> >>> In the hadoop-site.xml I have this set up >>> >>> <configuration> >>> <property> >>> <name>hadoop.tmp.dir</name> >>> <value>/webroot/oscrawlers/nutch/tmp/</value> >>> <description>A base for other temporary >>> directories.</description> >>> </property> >>> >>> </configuration> >>> >>> I am using the following command line options to run Nutch 1.0 >>> >>> /webroot/oscrawlers/nutch/bin/nutch crawl >>> /webroot/oscrawlers/nutch/urls/seed.txt -dir >>> /webroot/oscrawlers/nutch/crawl -depth 10 >& >>> /webroot/oscrawlers/nutch/logs/crawl_log.txt >>> >>> In my log file I see this error message: >>> >>> LinkDb: adding segment: >>> file:/webroot/oscrawlers/nutch/crawl/segments/20090714095100 >>> Exception in thread "main" java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) >>> >>> Jake Jacobson >>> >>> http://www.linkedin.com/in/jakejacobson >>> http://www.facebook.com/jakecjacobson >>> http://twitter.com/jakejacobson >>> >>> Our greatest fear should not be of failure, >>> but of succeeding at something that doesn't really matter. >>> -- ANONYMOUS >>> >>> >>> >>> On Mon, Jul 13, 2009 at 9:00 AM, SunGod<[email protected]> wrote: >>>> if you use hadoop run nutch >>>> >>>> please add >>>> >>>> <property> >>>> <name>hadoop.tmp.dir</name> >>>> <value>/youtempfs/hadoop-${user.name}</value> >>>> <description>A base for other temporary directories.</description> >>>> </property> >>>> >>>> to you hadoop-site.xml >>>> >>>> 2009/7/13 Jake Jacobson <[email protected]> >>>> >>>>> Hi, >>>>> >>>>> I have tried to run nutch 1.0 several times and it fails due to lack >>>>> of disk space. I have defined the crawl to place all files on a disk >>>>> that has plenty of space but when it starts building the linkdb it >>>>> wants to put temp files in the home dir which doesn't have enough >>>>> space. How can I force Nutch not to do this? >>>>> >>>>> Jake Jacobson >>>>> >>>>> http://www.linkedin.com/in/jakejacobson >>>>> http://www.facebook.com/jakecjacobson >>>>> http://twitter.com/jakejacobson >>>>> >>>>> Our greatest fear should not be of failure, >>>>> but of succeeding at something that doesn't really matter. >>>>> -- ANONYMOUS >>>>> >>>> >>> >> >> >> >> -- >> Doğacan Güney >> > -- Doğacan Güney
