We ended up reading and writing to S3 a ton in our Spark jobs.
For this to work, we ended up having to add s3a, and s3 key/secret pairs.
We also had to add fs.hdfs.impl to get these things to work.

I thought maybe I'd share what we did and it might be worth adding these to
the spark conf for out of the box functionality with S3.

We created:
ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml

We changed the contents form the original, adding in the following:

  <property>
    <name>fs.file.impl</name>
    <value>org.apache.hadoop.fs.LocalFileSystem</value>
  </property>

  <property>
    <name>fs.hdfs.impl</name>
    <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
  </property>

  <property>
    <name>fs.s3.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
  </property>

  <property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>{{aws_access_key_id}}</value>
  </property>

  <property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>{{aws_secret_access_key}}</value>
  </property>

  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>{{aws_access_key_id}}</value>
  </property>

  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>{{aws_secret_access_key}}</value>
  </property>

  <property>
    <name>fs.s3a.awsAccessKeyId</name>
    <value>{{aws_access_key_id}}</value>
  </property>

  <property>
    <name>fs.s3a.awsSecretAccessKey</name>
    <value>{{aws_secret_access_key}}</value>
  </property>

This change makes spark on ec2 work out of the box for us. It took us
several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
version 2.

Best Regards,
Christian

Reply via email to