Specify temp/working directory for crawl
----------------------------------------

         Key: NUTCH-159
         URL: http://issues.apache.org/jira/browse/NUTCH-159
     Project: Nutch
        Type: Bug
  Components: fetcher, indexer  
    Versions: 0.8-dev    
 Environment: Linux/Debian
    Reporter: byron miller


I ran a crawl of 100k web pages and got:

org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
        at 
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
        at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
        at 
org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
        at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
        at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:260)
        at 
org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
        ... 4 more
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
[EMAIL PROTECTED]:/data/nutch$ df -k


It appears crawl created a /tmp/nutch directory that filled up even though i 
specified a db directory.

Need to add a parameter to the command line or make a globaly configurable /tmp 
(work area) for the nutch instance so that crawls won't fail.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to