Here is an example of how I would pass in the S3 parameters to hadoop
configuration in pyspark.
You can do something similar for other parameters you want to pass to the
hadoop configuration

hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId",$your_access_key_id)
hadoopConf.set("fs.s3n.awsSecretAccessKey",$your_secret_access_key)

lines = sc.textFile($your_dataset_in_S3)
lines.count()


On Thu, May 14, 2015 at 4:17 AM, ayan guha <guha.a...@gmail.com> wrote:

> Jo
>
> Thanks for the reply, but _jsc does not have anything to pass hadoop
> configs. can you illustrate your answer a bit more? TIA...
>
> On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha <sriharsha....@gmail.com>
> wrote:
>
>> yes, the SparkContext in the Python API has a reference to the
>> JavaSparkContext (jsc)
>>
>> https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
>>
>> through which you can access the hadoop configuration
>>
>> On Tue, May 12, 2015 at 6:39 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I found this method in scala API but not in python API (1.3.1).
>>>
>>> Basically, I want to change blocksize in order to read a binary file
>>> using sc.binaryRecords but with multiple partitions (for testing I want to
>>> generate partitions smaller than default blocksize)/
>>>
>>> Is it possible in python? if so, how?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Reply via email to