Oh right. I forgot about the libraries being removed. On Thu, Nov 5, 2015 at 10:35 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote:
> I might be mistaken, but yes, even with the changes you mentioned you will > not be able to access S3 if Spark is built against Hadoop 2.6+ unless you > install additional libraries. The issue is explained in SPARK-7481 > <https://issues.apache.org/jira/browse/SPARK-7481> and SPARK-7442 > <https://issues.apache.org/jira/browse/SPARK-7442>. > > On Fri, Nov 6, 2015 at 12:22 AM Christian <engr...@gmail.com> wrote: > >> Even with the changes I mentioned above? >> On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you >>> cannot access S3, unfortunately. >>> >>> On Thu, Nov 5, 2015 at 3:53 PM Christian <engr...@gmail.com> wrote: >>> >>>> I created the cluster with the following: >>>> >>>> --hadoop-major-version=2 >>>> --spark-version=1.4.1 >>>> >>>> from: spark-1.5.1-bin-hadoop1 >>>> >>>> Are you saying there might be different behavior if I download >>>> spark-1.5.1-hadoop-2.6 and create my cluster? >>>> >>>> On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote: >>>> >>>>> Spark 1.5.1-hadoop1 >>>>> >>>>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> > I am using both 1.4.1 and 1.5.1. >>>>>> >>>>>> That's the Spark version. I'm wondering what version of Hadoop your >>>>>> Spark is built against. >>>>>> >>>>>> For example, when you download Spark >>>>>> <http://spark.apache.org/downloads.html> you have to select from a >>>>>> number of packages (under "Choose a package type"), and each is built >>>>>> against a different version of Hadoop. When Spark is built against Hadoop >>>>>> 2.6+, from my understanding, you need to install additional libraries >>>>>> <https://issues.apache.org/jira/browse/SPARK-7481> to access S3. >>>>>> When Spark is built against Hadoop 2.4 or earlier, you don't need to do >>>>>> this. >>>>>> >>>>>> I'm confirming that this is what is happening in your case. >>>>>> >>>>>> Nick >>>>>> >>>>>> On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote: >>>>>> >>>>>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because >>>>>>> of the new feature for instance-profile which greatly helps with this as >>>>>>> well. >>>>>>> Without the instance-profile, we got it working by copying a >>>>>>> .aws/credentials file up to each node. We could easily automate that >>>>>>> through the templates. >>>>>>> >>>>>>> I don't need any additional libraries. We just need to change the >>>>>>> core-site.xml >>>>>>> >>>>>>> -Christian >>>>>>> >>>>>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas < >>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks for sharing this, Christian. >>>>>>>> >>>>>>>> What build of Spark are you using? If I understand correctly, if >>>>>>>> you are using Spark built against Hadoop 2.6+ then additional configs >>>>>>>> alone >>>>>>>> won't help because additional libraries also need to be installed >>>>>>>> <https://issues.apache.org/jira/browse/SPARK-7481>. >>>>>>>> >>>>>>>> Nick >>>>>>>> >>>>>>>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We ended up reading and writing to S3 a ton in our Spark jobs. >>>>>>>>> For this to work, we ended up having to add s3a, and s3 key/secret >>>>>>>>> pairs. We also had to add fs.hdfs.impl to get these things to work. >>>>>>>>> >>>>>>>>> I thought maybe I'd share what we did and it might be worth adding >>>>>>>>> these to the spark conf for out of the box functionality with S3. >>>>>>>>> >>>>>>>>> We created: >>>>>>>>> >>>>>>>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml >>>>>>>>> >>>>>>>>> We changed the contents form the original, adding in the following: >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.file.impl</name> >>>>>>>>> <value>org.apache.hadoop.fs.LocalFileSystem</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.hdfs.impl</name> >>>>>>>>> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3.impl</name> >>>>>>>>> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3.awsAccessKeyId</name> >>>>>>>>> <value>{{aws_access_key_id}}</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3.awsSecretAccessKey</name> >>>>>>>>> <value>{{aws_secret_access_key}}</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3n.awsAccessKeyId</name> >>>>>>>>> <value>{{aws_access_key_id}}</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3n.awsSecretAccessKey</name> >>>>>>>>> <value>{{aws_secret_access_key}}</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3a.awsAccessKeyId</name> >>>>>>>>> <value>{{aws_access_key_id}}</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> <property> >>>>>>>>> <name>fs.s3a.awsSecretAccessKey</name> >>>>>>>>> <value>{{aws_secret_access_key}}</value> >>>>>>>>> </property> >>>>>>>>> >>>>>>>>> This change makes spark on ec2 work out of the box for us. It took >>>>>>>>> us several days to figure this out. It works for 1.4.1 and 1.5.1 on >>>>>>>>> Hadoop >>>>>>>>> version 2. >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Christian >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>