Re: SparkSQL integration issue with AWS S3a

Jerry Lam Fri, 01 Jan 2016 13:36:07 -0800

Hi Kostiantyn,

You should be able to use spark.conf to specify s3a keys.


I don't remember exactly but you can add hadoop properties by prefixing 
spark.hadoop.*
* is the s3a properties. For instance,

spark.hadoop.s3a.access.key wudjgdueyhsj

Of course, you need to make sure the property key is right. I'm using my phone 
so I cannot easily verifying.

Then you can specify different user using different spark.conf via 
--properties-file when spark-submit

HTH,

Jerry

Sent from my iPhone

> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
> <kudryavtsev.konstan...@gmail.com> wrote:
> 
> Hi Jerry,
> 
> what you suggested looks to be working (I put hdfs-site.xml into 
> $SPARK_HOME/conf folder), but could you shed some light on how it can be 
> federated per user?
> Thanks in advance!
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam <chiling...@gmail.com> wrote:
>> Hi Kostiantyn,
>> 
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
>> could define different spark-{user-x}.conf and source them during 
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> Sent from my iPhone
>> 
>>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>> <kudryavtsev.konstan...@gmail.com> wrote:
>>> 
>>> Hi Jerry,
>>> 
>>> I want to run different jobs on different S3 buckets - different AWS creds 
>>> - on the same instances. Could you shed some light if it's possible to 
>>> achieve with hdfs-site?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
>>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>> Hi Kostiantyn,
>>>> 
>>>> Can you define those properties in hdfs-site.xml and make sure it is 
>>>> visible in the class path when you spark-submit? It looks like a conf 
>>>> sourcing issue to me. 
>>>> 
>>>> Cheers,
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>>>>> <kudryavtsev.konstan...@gmail.com> wrote:
>>>>> 
>>>>> Chris,
>>>>> 
>>>>> thanks for the hist with AIM roles, but in my case  I need to run 
>>>>> different jobs with different S3 permissions on the same cluster, so this 
>>>>> approach doesn't work for me as far as I understood it
>>>>> 
>>>>> Thank you,
>>>>> Konstantin Kudryavtsev
>>>>> 
>>>>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>> couple things:
>>>>>> 
>>>>>> 1) switch to IAM roles if at all possible - explicitly passing AWS 
>>>>>> credentials is a long and lonely road in the end
>>>>>> 
>>>>>> 2) one really bad workaround/hack is to run a job that hits every worker 
>>>>>> and writes the credentials to the proper location (~/.awscredentials or 
>>>>>> whatever)
>>>>>> 
>>>>>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
>>>>>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>>>> 
>>>>>> if you switch to IAM roles, things become a lot easier as you can 
>>>>>> authorize all of the EC2 instances in the cluster - and handles 
>>>>>> autoscaling very well - and at some point, you will want to autoscale.
>>>>>> 
>>>>>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>>>>>> <kudryavtsev.konstan...@gmail.com> wrote:
>>>>>>> Chris,
>>>>>>> 
>>>>>>>  good question, as you can see from the code I set up them on driver, 
>>>>>>> so I expect they will be propagated to all nodes, won't them?
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Konstantin Kudryavtsev
>>>>>>> 
>>>>>>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>>>> are the credentials visible from each Worker node to all the Executor 
>>>>>>>> JVMs on each Worker?
>>>>>>>> 
>>>>>>>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>>>>>>>>> <kudryavtsev.konstan...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Dear Spark community,
>>>>>>>>> 
>>>>>>>>> I faced the following issue with trying accessing data on S3a, my 
>>>>>>>>> code is the following:
>>>>>>>>> 
>>>>>>>>> val sparkConf = new SparkConf()
>>>>>>>>> 
>>>>>>>>> val sc = new SparkContext(sparkConf)
>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>>>>>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>>>>>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>>>>>>> val df = sqlContext.read.parquet(...)
>>>>>>>>> df.count
>>>>>>>>> 
>>>>>>>>> It results in the following exception and log messages:
>>>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>>>>>>>> credentials from BasicAWSCredentialsProvider: Access key or secret 
>>>>>>>>> key is null
>>>>>>>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>>>>>>>>> metadata service at URL: 
>>>>>>>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>>>>>>>> credentials from InstanceProfileCredentialsProvider: The requested 
>>>>>>>>> metadata is not found at 
>>>>>>>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>>>> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 
>>>>>>>>> (TID 3)
>>>>>>>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials 
>>>>>>>>> from any provider in the chain
>>>>>>>>>       at 
>>>>>>>>> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>>>>>       at 
>>>>>>>>> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>>>>>>>>       at 
>>>>>>>>> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>>>>>>>>       at 
>>>>>>>>> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>>>>>>>>       at 
>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>>>>>>> 
>>>>>>>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>>>>>>> 
>>>>>>>>> any ideas/workarounds?
>>>>>>>>> 
>>>>>>>>> AWS credentials are correct for this bucket
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> Konstantin Kudryavtsev
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> 
>>>>>> Chris Fregly
>>>>>> Principal Data Solutions Engineer
>>>>>> IBM Spark Technology Center, San Francisco, CA
>>>>>> http://spark.tc | http://advancedspark.com
>

Re: SparkSQL integration issue with AWS S3a

Reply via email to