Sean Dean wrote: > I think the general rule is you will require about 2.5 to 3 times the size of > the final product. This is due to Hadoop creating the reduce files after the > maps are produced, before the maps can be removed. > > I'm not aware of any way to change this, I think its just "normal" > functionality.
The space consumption is at its worst on single machine configuration where you have to process all the data on 1 machine. If you have more machines to spare then the space required per machine can (obviously) be divided roughly by the amount of machines. I think the only way to cut down your temp size requirements (after compression, I think it's possible to compress the temp data?) is to do your work in smaller slices. -- Sami Siren > > > ----- Original Message ---- > From: qi wu <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, April 11, 2007 10:41:35 AM > Subject: Re: How to recude the tmp disk space usage during linkdb process? > > > One more general questions related with this issue is :How to estimate the > tmp space required by the overall process which include fetching,update > crawldb,building linkdb and indexing ? > For my case, 20G space for crawdb and all segments require more than 36G > space for linking DB tmp space, sounds unreasonable! > > ----- Original Message ----- > From: "qi wu" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Wednesday, April 11, 2007 10:15 PM > Subject: Re: How to recude the tmp disk space usage during linkdb process? > > >> it's impossible for me to change to 0.9 now,anyway ,thank you! >> >> ----- Original Message ----- >> From: "Sean Dean" <[EMAIL PROTECTED]> >> To: <[email protected]> >> Sent: Wednesday, April 11, 2007 9:33 PM >> Subject: Re: How to recude the tmp disk space usage during linkdb process? >> >> >>> Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) >>> to reduce overall space. The average compression ratio using zlib is about >>> 6:1 on those two databases and doesn't slow additions or segment creation >>> down. >>> >>> Keep in mind, this currently only works officially on Linux and >>> unofficially on FreeBSD. >>> >>> >>> ----- Original Message ---- >>> From: qi wu <[EMAIL PROTECTED]> >>> To: [email protected] >>> Sent: Wednesday, April 11, 2007 9:01:30 AM >>> Subject: How to recude the tmp disk space usage during linkdb process? >>> >>> >>> Hi, >>> I have cralwed nearly 3millon pages which are kept in 13 segements and >>> there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux >>> box,currently the disk occupied by crawldb and segments is about 20G ,and >>> the machine still have 36G space left. I always failed in building linkdb, >>> and the error was caused by no space left for reducing process, the >>> exception is listed below: >>> job_f506pk >>> org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device >>> at >>> org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150) >>> at >>> org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83) >>> at >>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112) >>> at >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) >>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) >>> at >>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738) >>> at >>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112) >>> at >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) >>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) >>> at >>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112) >>> >>> I wonder why so much space are required by linkdb reduce job, can I config >>> some nutch or hadoop setting to reduce the disk space usage for linkdb? Any >>> hints for me to overcome the problem? //bow >>> >>> Thanks >>> -Qi ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
