Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this:
class StreamingJob(): def __init__(..): ... sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key',....) sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key',....) def doJob(self): ssc = StreamingContext.getOrCreate('<S3-location>', <function to create ssc>) and I run it: myJob = StreamingJob(...) myJob.doJob() The problem is that StreamingContext.getOrCreate is not able to have access to hadoop configuration configured in the constructor and fails to load from checkpoint with "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain" If I export AWS credentials to the system ENV before starting the script it works! I see the Scala version has an option to provide the hadoop configuration that is not available in python I don't have the whole Hadoop, just Spark, so I don't really want to configure hadoop's xmls and such What is the cleanest way to achieve my goal? thanks!