I think any requests going to s3*:// requires the credentials. If they have
made it public (via http) then you won't require the keys.

Thanks
Best Regards

On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto <rpagli...@appcomsci.com>
wrote:

> Hi Sujit,
>
> I just wanted to access public datasets on Amazon. Do I still need to
> provide the keys?
>
>
>
> Thank you,
>
>
>
>
>
> *From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
> *Sent:* Tuesday, July 14, 2015 3:14 PM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on EMR with S3 example (Python)
>
>
>
> Hi Roberto,
>
>
>
> I have written PySpark code that reads from private S3 buckets, it should
> be similar for public S3 buckets as well. You need to set the AWS access
> and secret keys into the SparkContext, then you can access the S3 folders
> and files with their s3n:// paths. Something like this:
>
>
>
> sc = SparkContext()
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> aws_secret_key)
>
>
>
> mydata = sc.textFile("s3n://mybucket/my_input_folder") \
>
>                     .map(lambda x: do_something(x)) \
>
>                     .saveAsTextFile("s3://mybucket/my_output_folder")
>
> ...
>
>
>
> You can read and write sequence files as well - these are the only 2
> formats I have tried, but I'm sure the other ones like JSON would work
> also. Another approach is to embed the AWS access key and secret key into
> the s3n:// path.
>
>
>
> I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
> an older version but not sure) but it works for access.
>
>
>
> Hope this helps,
>
> Sujit
>
>
>
>
>
> On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <
> rpagli...@appcomsci.com> wrote:
>
> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>
>
>

Reply via email to