Re: Issue with merging segments with s/w built from main trunk

Doğacan Güney Sun, 25 Jan 2009 03:55:02 -0800

On Sun, Jan 25, 2009 at 1:17 PM, Venkatesh Babu <[email protected]> wrote:
>
> Hello Doğacan,
>     Thanks for the reply. I had a query on the following:
> ">> 1. Does the map-reduce operation involve intermediate data which is as
> high
>>> ratio wise?
>
>>Yes."
> 1. In your past experience what is the general ratio of size of segments
> being merged to the maximum disk space required during the merge operation?
> 2. If you were using nutch prior to hadoop implementation was it any better
> when run without hadoop?
>


I have never used a hadoop-free nutch and I have not been working with large
segments for a looong time so I don't know :)

>>"Compressing temporary outputs may help you here. "
> 3. I guess compression would have a cost. Since it is already taking me more
> than a day to merge these segments which are only 3GB and I have a task to
> merge segments of 40GB or more, I was wondering how long this would take if
> I enable compression. Guess my question is would you have any data on how
> slow the merge would become if I enable compression of map output.
>

I have done some analysis in https://issues.apache.org/jira/browse/NUTCH-392

Short answer is: don't worry too much :) Especially if you use lzo,
performance will
be very good.

Another thing: If you do not need "content" directory in your merged
segment, just
rename "content" directories in your segments to something else. This way
mergesegs will not merge "content"s and this should reduce size
requirement a lot.

> Thanks,
> VB
>
> --
> View this message in context: 
> http://www.nabble.com/Issue-with-merging-segments-with-s-w-built-from-main-trunk-tp21641977p21650571.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: Issue with merging segments with s/w built from main trunk

Reply via email to