Even with the changes I mentioned above? On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote:
> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you > cannot access S3, unfortunately. > > On Thu, Nov 5, 2015 at 3:53 PM Christian <engr...@gmail.com> wrote: > >> I created the cluster with the following: >> >> --hadoop-major-version=2 >> --spark-version=1.4.1 >> >> from: spark-1.5.1-bin-hadoop1 >> >> Are you saying there might be different behavior if I download >> spark-1.5.1-hadoop-2.6 and create my cluster? >> >> On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote: >> >>> Spark 1.5.1-hadoop1 >>> >>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> > I am using both 1.4.1 and 1.5.1. >>>> >>>> That's the Spark version. I'm wondering what version of Hadoop your >>>> Spark is built against. >>>> >>>> For example, when you download Spark >>>> <http://spark.apache.org/downloads.html> you have to select from a >>>> number of packages (under "Choose a package type"), and each is built >>>> against a different version of Hadoop. When Spark is built against Hadoop >>>> 2.6+, from my understanding, you need to install additional libraries >>>> <https://issues.apache.org/jira/browse/SPARK-7481> to access S3. When >>>> Spark is built against Hadoop 2.4 or earlier, you don't need to do this. >>>> >>>> I'm confirming that this is what is happening in your case. >>>> >>>> Nick >>>> >>>> On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote: >>>> >>>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of >>>>> the new feature for instance-profile which greatly helps with this as >>>>> well. >>>>> Without the instance-profile, we got it working by copying a >>>>> .aws/credentials file up to each node. We could easily automate that >>>>> through the templates. >>>>> >>>>> I don't need any additional libraries. We just need to change the >>>>> core-site.xml >>>>> >>>>> -Christian >>>>> >>>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> Thanks for sharing this, Christian. >>>>>> >>>>>> What build of Spark are you using? If I understand correctly, if you >>>>>> are using Spark built against Hadoop 2.6+ then additional configs alone >>>>>> won't help because additional libraries also need to be installed >>>>>> <https://issues.apache.org/jira/browse/SPARK-7481>. >>>>>> >>>>>> Nick >>>>>> >>>>>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> wrote: >>>>>> >>>>>>> We ended up reading and writing to S3 a ton in our Spark jobs. >>>>>>> For this to work, we ended up having to add s3a, and s3 key/secret >>>>>>> pairs. We also had to add fs.hdfs.impl to get these things to work. >>>>>>> >>>>>>> I thought maybe I'd share what we did and it might be worth adding >>>>>>> these to the spark conf for out of the box functionality with S3. >>>>>>> >>>>>>> We created: >>>>>>> >>>>>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml >>>>>>> >>>>>>> We changed the contents form the original, adding in the following: >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.file.impl</name> >>>>>>> <value>org.apache.hadoop.fs.LocalFileSystem</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.hdfs.impl</name> >>>>>>> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3.impl</name> >>>>>>> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3.awsAccessKeyId</name> >>>>>>> <value>{{aws_access_key_id}}</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3.awsSecretAccessKey</name> >>>>>>> <value>{{aws_secret_access_key}}</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3n.awsAccessKeyId</name> >>>>>>> <value>{{aws_access_key_id}}</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3n.awsSecretAccessKey</name> >>>>>>> <value>{{aws_secret_access_key}}</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3a.awsAccessKeyId</name> >>>>>>> <value>{{aws_access_key_id}}</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>fs.s3a.awsSecretAccessKey</name> >>>>>>> <value>{{aws_secret_access_key}}</value> >>>>>>> </property> >>>>>>> >>>>>>> This change makes spark on ec2 work out of the box for us. It took >>>>>>> us several days to figure this out. It works for 1.4.1 and 1.5.1 on >>>>>>> Hadoop >>>>>>> version 2. >>>>>>> >>>>>>> Best Regards, >>>>>>> Christian >>>>>>> >>>>>> >>>>> >>> >>