I think any requests going to s3*:// requires the credentials. If they have made it public (via http) then you won't require the keys.
Thanks Best Regards On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto <rpagli...@appcomsci.com> wrote: > Hi Sujit, > > I just wanted to access public datasets on Amazon. Do I still need to > provide the keys? > > > > Thank you, > > > > > > *From:* Sujit Pal [mailto:sujitatgt...@gmail.com] > *Sent:* Tuesday, July 14, 2015 3:14 PM > *To:* Pagliari, Roberto > *Cc:* user@spark.apache.org > *Subject:* Re: Spark on EMR with S3 example (Python) > > > > Hi Roberto, > > > > I have written PySpark code that reads from private S3 buckets, it should > be similar for public S3 buckets as well. You need to set the AWS access > and secret keys into the SparkContext, then you can access the S3 folders > and files with their s3n:// paths. Something like this: > > > > sc = SparkContext() > > sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key) > > sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", > aws_secret_key) > > > > mydata = sc.textFile("s3n://mybucket/my_input_folder") \ > > .map(lambda x: do_something(x)) \ > > .saveAsTextFile("s3://mybucket/my_output_folder") > > ... > > > > You can read and write sequence files as well - these are the only 2 > formats I have tried, but I'm sure the other ones like JSON would work > also. Another approach is to embed the AWS access key and secret key into > the s3n:// path. > > > > I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its > an older version but not sure) but it works for access. > > > > Hope this helps, > > Sujit > > > > > > On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto < > rpagli...@appcomsci.com> wrote: > > Is there an example about how to load data from a public S3 bucket in > Python? I haven’t found any. > > > > Thank you, > > > > >