Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) to reduce overall space. The average compression ratio using zlib is about 6:1 on those two databases and doesn't slow additions or segment creation down. Keep in mind, this currently only works officially on Linux and unofficially on FreeBSD.
----- Original Message ---- From: qi wu <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, April 11, 2007 9:01:30 AM Subject: How to recude the tmp disk space usage during linkdb process? Hi, I have cralwed nearly 3millon pages which are kept in 13 segements and there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux box,currently the disk occupied by crawldb and segments is about 20G ,and the machine still have 36G space left. I always failed in building linkdb, and the error was caused by no space left for reducing process, the exception is listed below: job_f506pk org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150) at org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913) at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913) at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112) I wonder why so much space are required by linkdb reduce job, can I config some nutch or hadoop setting to reduce the disk space usage for linkdb? Any hints for me to overcome the problem? //bow Thanks -Qi
