Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you cannot access S3, unfortunately.
On Thu, Nov 5, 2015 at 3:53 PM Christian <engr...@gmail.com> wrote: > I created the cluster with the following: > > --hadoop-major-version=2 > --spark-version=1.4.1 > > from: spark-1.5.1-bin-hadoop1 > > Are you saying there might be different behavior if I download > spark-1.5.1-hadoop-2.6 and create my cluster? > > On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote: > >> Spark 1.5.1-hadoop1 >> >> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> > I am using both 1.4.1 and 1.5.1. >>> >>> That's the Spark version. I'm wondering what version of Hadoop your >>> Spark is built against. >>> >>> For example, when you download Spark >>> <http://spark.apache.org/downloads.html> you have to select from a >>> number of packages (under "Choose a package type"), and each is built >>> against a different version of Hadoop. When Spark is built against Hadoop >>> 2.6+, from my understanding, you need to install additional libraries >>> <https://issues.apache.org/jira/browse/SPARK-7481> to access S3. When >>> Spark is built against Hadoop 2.4 or earlier, you don't need to do this. >>> >>> I'm confirming that this is what is happening in your case. >>> >>> Nick >>> >>> On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote: >>> >>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of >>>> the new feature for instance-profile which greatly helps with this as well. >>>> Without the instance-profile, we got it working by copying a >>>> .aws/credentials file up to each node. We could easily automate that >>>> through the templates. >>>> >>>> I don't need any additional libraries. We just need to change the >>>> core-site.xml >>>> >>>> -Christian >>>> >>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>>> Thanks for sharing this, Christian. >>>>> >>>>> What build of Spark are you using? If I understand correctly, if you >>>>> are using Spark built against Hadoop 2.6+ then additional configs alone >>>>> won't help because additional libraries also need to be installed >>>>> <https://issues.apache.org/jira/browse/SPARK-7481>. >>>>> >>>>> Nick >>>>> >>>>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> wrote: >>>>> >>>>>> We ended up reading and writing to S3 a ton in our Spark jobs. >>>>>> For this to work, we ended up having to add s3a, and s3 key/secret >>>>>> pairs. We also had to add fs.hdfs.impl to get these things to work. >>>>>> >>>>>> I thought maybe I'd share what we did and it might be worth adding >>>>>> these to the spark conf for out of the box functionality with S3. >>>>>> >>>>>> We created: >>>>>> >>>>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml >>>>>> >>>>>> We changed the contents form the original, adding in the following: >>>>>> >>>>>> <property> >>>>>> <name>fs.file.impl</name> >>>>>> <value>org.apache.hadoop.fs.LocalFileSystem</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.hdfs.impl</name> >>>>>> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3.impl</name> >>>>>> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3.awsAccessKeyId</name> >>>>>> <value>{{aws_access_key_id}}</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3.awsSecretAccessKey</name> >>>>>> <value>{{aws_secret_access_key}}</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3n.awsAccessKeyId</name> >>>>>> <value>{{aws_access_key_id}}</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3n.awsSecretAccessKey</name> >>>>>> <value>{{aws_secret_access_key}}</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3a.awsAccessKeyId</name> >>>>>> <value>{{aws_access_key_id}}</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fs.s3a.awsSecretAccessKey</name> >>>>>> <value>{{aws_secret_access_key}}</value> >>>>>> </property> >>>>>> >>>>>> This change makes spark on ec2 work out of the box for us. It took us >>>>>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop >>>>>> version 2. >>>>>> >>>>>> Best Regards, >>>>>> Christian >>>>>> >>>>> >>>> >> >