I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS somehow.
I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk. Each HDFS node reports 73 GB, and the total capacity is ~370 GB. If I want to process 800 GB of data (assuming I can't split the jobs up), I'm guessing I need to get persistent-hdfs involved. 1 - Does persistent-hdfs have noticeably different performance than ephemeral-hdfs? 2 - If so, is there a recommended configuration (like storing input and output on persistent, but persisted RDDs on ephemeral?) This seems like a common use-case, so sorry if this has already been covered. Joe