use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster
for directory lookups than jets3t
> On 13 Jan 2016, at 11:42, Darin McBeath wrote:
>
> I'm looking for some suggestions based on other's experiences.
>
> I currently have a job that I
g>
> *Sent:* Wednesday, January 13, 2016 2:48 PM
> *Subject:* Re: Best practice for retrieving over 1 million files from S3
>
> Hi Darin,
>
> You should read this article. TextFile is very inefficient in S3.
>
> http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-s
Hi Darin,
You should read this article. TextFile is very inefficient in S3.
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
Cheers
On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath
wrote:
> I'm looking for some suggestions based on
I'm looking for some suggestions based on other's experiences.
I currently have a job that I need to run periodically where I need to read on
the order of 1+ million files from an S3 bucket. It is not the entire bucket
(nor does it match a pattern). Instead, I have a list of random keys that