Hi All here goes my first question : Here is my use case
I have 1TB data I want to process on ec2 using spark I have uploaded the data on ebs volume The instruction on amazon ec2 set up explains "*If your application needs to access large datasets, the fastest way to do that is to load them from Amazon S3 or an Amazon EBS device into an instance of the Hadoop Distributed File System (HDFS) on your nodes*" Now the new amazon instances don't have any physical volume http://aws.amazon.com/ec2/instance-types/ So do I need to do a set up for HDFS separately on ec2 (instruction also says The spark-ec2 script already sets up a HDFS instance for you") ? Any blog/write up which can help me understanding this better ? ~Manish