I figured it out. I had to add this line to the script: sc._jsc.hadoopConfiguration().set("fs.s3.canned.acl", "BucketOwnerFullControl")
Bascially, I had to get the JavaSparkContext in the SparkContext to access the Hadoop configuration to set the permissions. Follow up question: Is there a better way to get the JavaSparkContext or the Hadoop Configuration from the pyspark SparkContext? Accessing a protected variable directly doesn't seem right. Thanks, Justin On Fri, Jun 5, 2015 at 3:02 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > You could try adding the configuration in the spark-defaults.conf file. > And once you run the application you can actually check on the driver UI > (runs on 4040) Environment tab to see if the configuration is set properly. > > Thanks > Best Regards > > On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel <jsteigs...@gmail.com> > wrote: > >> Hi all, >> >> I'm running Spark on AWS EMR and I'm having some issues getting the >> correct permissions on the output files using >> rdd.saveAsTextFile('<file_dir_name>'). In hive, I would add a line in the >> beginning of the script with >> >> set fs.s3.canned.acl=BucketOwnerFullControl >> >> and that would set the correct grantees for the files. For Spark, I tried >> adding the permissions as a --conf option: >> >> hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \ >> /home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master >> yarn-cluster \ >> --conf "spark.driver.extraJavaOptions >> -Dfs.s3.canned.acl=BucketOwnerFullControl" \ >> hdfs:///user/hadoop/spark.py >> >> But the permissions do not get set properly on the output files. What is >> the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or >> any of the S3 canned permissions to the spark job? >> >> Thanks in advance >> > >