Replication also has downstream effects: it puts pressure on the available 
network bandwidth and disk I/O bandwidth when the cluster is loaded.
john

From: Mohammad Tariq [mailto:donta...@gmail.com]
Sent: Monday, July 01, 2013 6:35 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

I see. This difference is because of the fact that the next block of data will 
not be written to HDFS until the previous block was successfully written to 
'all' the DNs selected for replication. This implies that higher RF means more 
time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:39 AM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
I've seen some benchmarks where replication=1 runs at about 50MB/sec and 
replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:donta...@gmail.com<mailto:donta...@gmail.com>]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. 
Replica creation is handled at the HDFS layer so it has nothing to with your 
job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley 
<john.lil...@redpoint.net<mailto:john.lil...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature 
(consumed by the next processing stage) is it recommended to use a replication 
factor <3 to improve performance?
Thanks
john



Reply via email to