Re: PySpark Streaming S3 checkpointing

Steve Loughran Wed, 02 Aug 2017 09:05:44 -0700

On 2 Aug 2017, at 10:34, Riccardo Ferrari 
<ferra...@gmail.com<mailto:ferra...@gmail.com>> wrote:


Hi list!

I am working on a pyspark streaming job (ver 2.2.0) and I need to enable 
checkpointing. At high level my python script goes like this:

class StreamingJob():

def __init__(..):
...
   sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key',....)
   sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key',....)

def doJob(self):
   ssc = StreamingContext.getOrCreate('<S3-location>', <function to create ssc>)

and I run it:

myJob = StreamingJob(...)
myJob.doJob()

The problem is that StreamingContext.getOrCreate is not able to have access to 
hadoop configuration configured in the constructor and fails to load from 
checkpoint with

"com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
provider in the chain"

If I export AWS credentials to the system ENV before starting the script it 
works!


Spark magically copies the env vars over for you when you launch a job


I see the Scala version has an option to provide the hadoop configuration that 
is not available in python

I don't have the whole Hadoop, just Spark, so I don't really want to configure 
hadoop's xmls and such


when you set up the context, as in spark-defaults.conf

spark.hadoop.fs.s3a.access.key=access key
spark.hadoop.fs.s3a.secret.key=secret key

Reminder: Do keep your secret key a secret, avoid checking it in to any form of 
revision control.

Re: PySpark Streaming S3 checkpointing

Reply via email to