We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3?
It seems like there are many ways to bulk copy to s3. Many of them require we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@ <mailto:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt> . This seems like a bad idea? What would you recommend? Thanks Andy