Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

Sean Dean Wed, 11 Apr 2007 08:19:24 -0700

I think the general rule is you will require about 2.5 to 3 times the size of 
the final product. This is due to Hadoop creating the reduce files after the 
maps are produced, before the maps can be removed.
 
I'm not aware of any way to change this, I think its just "normal" 
functionality.


 
----- Original Message ----
From: qi wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, April 11, 2007 10:41:35 AM
Subject: Re: How to recude the tmp disk space usage during linkdb process?


One more general questions related with this issue is :How to estimate the  tmp 
space required by the overall process which include fetching,update 
crawldb,building linkdb and indexing ?
For my case, 20G space for crawdb and all segments require more than 36G space 
for linking DB tmp space, sounds unreasonable!

----- Original Message ----- 
From: "qi wu" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, April 11, 2007 10:15 PM
Subject: Re: How to recude the tmp disk space usage during linkdb process?


> it's impossible for me to change to 0.9 now,anyway ,thank you!
> 
> ----- Original Message ----- 
> From: "Sean Dean" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Wednesday, April 11, 2007 9:33 PM
> Subject: Re: How to recude the tmp disk space usage during linkdb process?
> 
> 
>> Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) to 
>> reduce overall space. The average compression ratio using zlib is about 6:1 
>> on those two databases and doesn't slow additions or segment creation down.
>> 
>> Keep in mind, this currently only works officially on Linux and unofficially 
>> on FreeBSD.
>> 
>> 
>> ----- Original Message ----
>> From: qi wu <[EMAIL PROTECTED]>
>> To: [email protected]
>> Sent: Wednesday, April 11, 2007 9:01:30 AM
>> Subject: How to recude the tmp disk space usage during linkdb process?
>> 
>> 
>> Hi,
>>  I have cralwed nearly 3millon pages which are kept in 13 segements and 
>> there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux 
>> box,currently the disk occupied by crawldb and segments is about 20G ,and 
>> the machine still have 36G space left. I always failed in building linkdb, 
>> and the error was caused by no space left for reducing process, the 
>> exception is listed below:
>> job_f506pk
>> org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
>>        at 
>> org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150)
>>        at 
>> org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83)
>>        at 
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
>>        at 
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
>>        at 
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
>>        at 
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
>>        at 
>> org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218)
>>        at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)
>> 
>> I wonder why so much space are required by linkdb reduce job, can I config 
>> some nutch or hadoop setting to reduce the disk space usage for linkdb? Any 
>> hints for me to overcome the problem? //bow
>> 
>> Thanks
>> -Qi

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

Reply via email to