Spark is inventing its own AWS secret key

2017-03-08 Thread Jonhy Stack
Hi,

I'm trying to read a s3 bucket from Spark and up until today Spark always
complain that the request return 403

hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "ACCESSKEY")
hadoopConf.set("fs.s3a.secret.key", "SECRETKEY")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
logs = spark_context.textFile("s3a://mybucket/logs/*)

Spark was saying  Invalid Access key [ACCESSKEY]

However with the same ACCESSKEY and SECRETKEY this was working with aws-cli

aws s3 ls mybucket/logs/

and in python boto3 this was working

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "logs/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")

so my credentials ARE invalid and the problem is definitely something with
Spark..

Today I decided to turn on the "DEBUG" log for the entire spark and to my
suprise... Spark is NOT using the [SECRETKEY] I have provided but
instead... add a random one???

17/03/08 10:40:04 DEBUG request: Sending Request: HEAD
https://mybucket.s3.amazonaws.com / Headers: (Authorization: AWS
ACCESSKEY:**[RANDON-SECRET-KEY]**, User-Agent: aws-sdk-java/1.7.4
Mac_OS_X/10.11.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.65-b01/1.8.0_65,
Date: Wed, 08 Mar 2017 10:40:04 GMT, Content-Type:
application/x-www-form-urlencoded; charset=utf-8, )

This is why it still return 403! Spark is not using the key I provide with
fs.s3a.secret.key but instead invent a random one EACH time (everytime I
submit the job the random secret key is different)

For the record I'm running this locally on my machine (OSX) with this
command

spark-submit --packages
com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3
test.py

Could some one enlighten me on this?


(python) Spark .textFile(s3://…) access denied 403 with valid credentials

2017-03-07 Thread Jonhy Stack
In order to access my S3 bucket i have exported my creds

export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=

I can verify that everything works by doing

aws s3 ls mybucket

I can also verify with boto3 that it works in python

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")

This works and I can see the file in the bucket.

However when I do this with spark:

spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)

I get a nice

> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message:  encoding="UTF-8"?>InvalidAccessKeyIdThe
> AWS Access Key Id you provided does not exist in our
> records.[MY_ACCESS_KEY]Xxxx

this is how I submit the job locally

spark-submit --packages com.amazonaws:aws-java-sdk-pom
:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py

Why does it works with command line + boto3 but spark is chocking ?