I figured it out.  I had to add this line to the script:

sc._jsc.hadoopConfiguration().set("fs.s3.canned.acl",
"BucketOwnerFullControl")

Bascially, I had to get the JavaSparkContext in the SparkContext to access
the Hadoop configuration to set the permissions.

Follow up question: Is there a better way to get the JavaSparkContext or
the Hadoop Configuration from the pyspark SparkContext?  Accessing a
protected variable directly doesn't seem right.

Thanks,
Justin

On Fri, Jun 5, 2015 at 3:02 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> You could try adding the configuration in the spark-defaults.conf file.
> And once you run the application you can actually check on the driver UI
> (runs on 4040) Environment tab to see if the configuration is set properly.
>
> Thanks
> Best Regards
>
> On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel <jsteigs...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I'm running Spark on AWS EMR and I'm having some issues getting the
>> correct permissions on the output files using
>> rdd.saveAsTextFile('<file_dir_name>').  In hive, I would add a line in the
>> beginning of the script with
>>
>> set fs.s3.canned.acl=BucketOwnerFullControl
>>
>> and that would set the correct grantees for the files. For Spark, I tried
>> adding the permissions as a --conf option:
>>
>> hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
>> /home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master
>> yarn-cluster \
>> --conf "spark.driver.extraJavaOptions
>> -Dfs.s3.canned.acl=BucketOwnerFullControl" \
>> hdfs:///user/hadoop/spark.py
>>
>> But the permissions do not get set properly on the output files. What is
>> the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or
>> any of the S3 canned permissions to the spark job?
>>
>> Thanks in advance
>>
>
>

Reply via email to