Here is an example of how I would pass in the S3 parameters to hadoop configuration in pyspark. You can do something similar for other parameters you want to pass to the hadoop configuration
hadoopConf=sc._jsc.hadoopConfiguration() hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3n.awsAccessKeyId",$your_access_key_id) hadoopConf.set("fs.s3n.awsSecretAccessKey",$your_secret_access_key) lines = sc.textFile($your_dataset_in_S3) lines.count() On Thu, May 14, 2015 at 4:17 AM, ayan guha <guha.a...@gmail.com> wrote: > Jo > > Thanks for the reply, but _jsc does not have anything to pass hadoop > configs. can you illustrate your answer a bit more? TIA... > > On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha <sriharsha....@gmail.com> > wrote: > >> yes, the SparkContext in the Python API has a reference to the >> JavaSparkContext (jsc) >> >> https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext >> >> through which you can access the hadoop configuration >> >> On Tue, May 12, 2015 at 6:39 AM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> I found this method in scala API but not in python API (1.3.1). >>> >>> Basically, I want to change blocksize in order to read a binary file >>> using sc.binaryRecords but with multiple partitions (for testing I want to >>> generate partitions smaller than default blocksize)/ >>> >>> Is it possible in python? if so, how? >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > > > -- > Best Regards, > Ayan Guha >