I have a similar use case, so I wrote a python script to fix the cluster configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster with enough machines that the hdfs system can hold 1Tb (so use instance types that have SSDs), then follow the instructions at http://thousandfold.net/cz/2015/07/01/installing-spark-with-hadoop-2-using-spark-ec2/. Let me know if you have any issues.
On Mon, Jun 29, 2015 at 4:32 PM, manish ranjan <cse1.man...@gmail.com> wrote: > > Hi All > > here goes my first question : > Here is my use case > > I have 1TB data I want to process on ec2 using spark > I have uploaded the data on ebs volume > The instruction on amazon ec2 set up explains > "*If your application needs to access large datasets, the fastest way to > do that is to load them from Amazon S3 or an Amazon EBS device into an > instance of the Hadoop Distributed File System (HDFS) on your nodes*" > > Now the new amazon instances don't have any physical volume > http://aws.amazon.com/ec2/instance-types/ > > So do I need to do a set up for HDFS separately on ec2 (instruction also > says The spark-ec2 script already sets up a HDFS instance for you") ? Any > blog/write up which can help me understanding this better ? > > ~Manish > > >