Why would you lose the locality of storage-per-machine if one EBS volume is
mounted to each machine instance?  When that machine goes down, you can just
restart the instance and re-mount the exact same volume.  I've tried this
idea before successfully on a 10 node cluster on EC2, and didn't see any
adverse performance effects--and actually amazon claims that EBS I/O should
be even better than the instance stores.  The only concerns I see are that
you need to pay for EBS storage regardless of whether you use that storage
or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
starting out with your cluster so you're using only 50GB on each EBS volume
so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
and that could be a hefty price for each month.  Also, currently EBS needs
to be created in the same availability zone as your instances, so you need
to make sure that they are created correctly, as there is no direct
migration of EBS to different availability zones.


On Wed, Mar 11, 2009 at 6:39 AM, Steve Loughran <ste...@apache.org> wrote:

> Malcolm Matalka wrote:
>
>> If this is not the correct place to ask Hadoop + EC2 questions please
>> let me know.
>>
>>
>> I am trying to get a handle on how to use Hadoop on EC2 before
>> committing any money to it.  My question is, how do I maintain a
>> persistent HDFS between restarts of instances.  Most of the tutorials I
>> have found involve the cluster being wiped once all the instances are
>> shut down but in my particular case I will be feeding output of a
>> previous days run as the input of the current days run and this data
>> will get large over time.  I see I can use s3 as the file system, would
>> I just create an EBS  volume for each instance?  What are my options?
>>
>
>  EBS would cost you more; you'd lose the locality of storage-per-machine.
>
> If you stick the output of some runs back into S3 then the next jobs have
> no locality and higher startup overhead to pull the data down, but you dont
> pay for that download (just the time it takes).
>

Reply via email to