Re: Hadoop & EC2

Russell Smith Tue, 02 Sep 2008 16:09:40 -0700

I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3has the same premise of availability zones that EC2 has.

Between different regions, data transfer is 1) charged for and 2) likelyslower between EC2 and S3-Europe.

Transfer between S3-US and EC2 is free of charge, and should besignificantly quicker.



Russell

Ryan LeCompte wrote:

How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:

On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:

Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.

I'm seeing much faster speeds.  With 128 nodes running a mapper-only
downloading job, downloading 30 GB takes roughly a minute, less time than
the end of job work (which I assume is HDFS replication and bookkeeping).
 More mappers gives you more parallel downloads, of course.  I'm using a
Python REST client for S3, and only move data to or from S3 when Hadoop is
done with it.

Make sure your S3 buckets and EC2 instances are in the same zone.

Re: Hadoop & EC2

Reply via email to