For lack of a better solution I am using AWS s3 copy¹ to copy my files locally and hadoop fs put ./tmp/* to transfer them. In general put works much better with a smaller number of big files compared to a large number of small files
Your milage may vary Andy From: Andrew Davidson <a...@santacruzintegration.com> Date: Wednesday, July 27, 2016 at 4:25 PM To: "user @spark" <user@spark.apache.org> Subject: how to copy local files to hdfs quickly? > I have a spark streaming app that saves JSON files to s3:// . It works fine > > Now I need to calculate some basic summary stats and am running into horrible > performance problems. > > I want to run a test to see if reading from hdfs instead of s3 makes > difference. I am able to quickly copy my the data from s3 to a machine in my > cluster how ever hadoop fs put is pain fully slow. Is there a better way to > copy large data to hdfs? > > I should mention I am not using EMR . I.E. According to AWS support there is > no way to have $aws s3¹ copy directory to hdfs:// > > Hadoop distcp can not copy files from the local files system > > Thanks in advance > > Andy > > > >