Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

Sean Dean Wed, 11 Apr 2007 06:36:28 -0700

Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) to 
reduce overall space. The average compression ratio using zlib is about 6:1 on 
those two databases and doesn't slow additions or segment creation down.
 
Keep in mind, this currently only works officially on Linux and unofficially on 
FreeBSD.



----- Original Message ----
From: qi wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, April 11, 2007 9:01:30 AM
Subject: How to recude the tmp disk space usage during linkdb process?


Hi,
  I have cralwed nearly 3millon pages which are kept in 13 segements and there 
have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux 
box,currently the disk occupied by crawldb and segments is about 20G ,and the 
machine still have 36G space left. I always failed in building linkdb, and the 
error was caused by no space left for reducing process, the exception is listed 
below:
job_f506pk
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
        at 
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150)
        at 
org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
        at 
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
        at 
org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
        at 
org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
        at 
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
        at 
org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
        at 
org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
        at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)

I wonder why so much space are required by linkdb reduce job, can I config some 
nutch or hadoop setting to reduce the disk space usage for linkdb? Any hints 
for me to overcome the problem? //bow

Thanks
-Qi

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

Reply via email to