Oh right. I forgot about the libraries being removed.
On Thu, Nov 5, 2015 at 10:35 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> I might be mistaken, but yes, even with the changes you mentioned you will
> not be able to access S3 if Spark is built against Hadoop 2.6+ unless you
> install additional libraries. The issue is explained in SPARK-7481
> <https://issues.apache.org/jira/browse/SPARK-7481> and SPARK-7442
> <https://issues.apache.org/jira/browse/SPARK-7442>.
>
> On Fri, Nov 6, 2015 at 12:22 AM Christian <engr...@gmail.com> wrote:
>
>> Even with the changes I mentioned above?
>> On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
>>> cannot access S3, unfortunately.
>>>
>>> On Thu, Nov 5, 2015 at 3:53 PM Christian <engr...@gmail.com> wrote:
>>>
>>>> I created the cluster with the following:
>>>>
>>>> --hadoop-major-version=2
>>>> --spark-version=1.4.1
>>>>
>>>> from: spark-1.5.1-bin-hadoop1
>>>>
>>>> Are you saying there might be different behavior if I download
>>>> spark-1.5.1-hadoop-2.6 and create my cluster?
>>>>
>>>> On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote:
>>>>
>>>>> Spark 1.5.1-hadoop1
>>>>>
>>>>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> > I am using both 1.4.1 and 1.5.1.
>>>>>>
>>>>>> That's the Spark version. I'm wondering what version of Hadoop your
>>>>>> Spark is built against.
>>>>>>
>>>>>> For example, when you download Spark
>>>>>> <http://spark.apache.org/downloads.html> you have to select from a
>>>>>> number of packages (under "Choose a package type"), and each is built
>>>>>> against a different version of Hadoop. When Spark is built against Hadoop
>>>>>> 2.6+, from my understanding, you need to install additional libraries
>>>>>> <https://issues.apache.org/jira/browse/SPARK-7481> to access S3.
>>>>>> When Spark is built against Hadoop 2.4 or earlier, you don't need to do
>>>>>> this.
>>>>>>
>>>>>> I'm confirming that this is what is happening in your case.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote:
>>>>>>
>>>>>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because
>>>>>>> of the new feature for instance-profile which greatly helps with this as
>>>>>>> well.
>>>>>>> Without the instance-profile, we got it working by copying a
>>>>>>> .aws/credentials file up to each node. We could easily automate that
>>>>>>> through the templates.
>>>>>>>
>>>>>>> I don't need any additional libraries. We just need to change the
>>>>>>> core-site.xml
>>>>>>>
>>>>>>> -Christian
>>>>>>>
>>>>>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks for sharing this, Christian.
>>>>>>>>
>>>>>>>> What build of Spark are you using? If I understand correctly, if
>>>>>>>> you are using Spark built against Hadoop 2.6+ then additional configs 
>>>>>>>> alone
>>>>>>>> won't help because additional libraries also need to be installed
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-7481>.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>>>>>>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>>>>>>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>>>>>>>
>>>>>>>>> I thought maybe I'd share what we did and it might be worth adding
>>>>>>>>> these to the spark conf for out of the box functionality with S3.
>>>>>>>>>
>>>>>>>>> We created:
>>>>>>>>>
>>>>>>>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>>>>>>>
>>>>>>>>> We changed the contents form the original, adding in the following:
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.file.impl</name>
>>>>>>>>>     <value>org.apache.hadoop.fs.LocalFileSystem</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.hdfs.impl</name>
>>>>>>>>>     <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3.impl</name>
>>>>>>>>>     <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3.awsAccessKeyId</name>
>>>>>>>>>     <value>{{aws_access_key_id}}</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3.awsSecretAccessKey</name>
>>>>>>>>>     <value>{{aws_secret_access_key}}</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3n.awsAccessKeyId</name>
>>>>>>>>>     <value>{{aws_access_key_id}}</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3n.awsSecretAccessKey</name>
>>>>>>>>>     <value>{{aws_secret_access_key}}</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3a.awsAccessKeyId</name>
>>>>>>>>>     <value>{{aws_access_key_id}}</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>fs.s3a.awsSecretAccessKey</name>
>>>>>>>>>     <value>{{aws_secret_access_key}}</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>> This change makes spark on ec2 work out of the box for us. It took
>>>>>>>>> us several days to figure this out. It works for 1.4.1 and 1.5.1 on 
>>>>>>>>> Hadoop
>>>>>>>>> version 2.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Christian
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>

Reply via email to