Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Steve Loughran
use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster for directory lookups than jets3t > On 13 Jan 2016, at 11:42, Darin McBeath wrote: > > I'm looking for some suggestions based on other's experiences. > > I currently have a job that I

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
g> > *Sent:* Wednesday, January 13, 2016 2:48 PM > *Subject:* Re: Best practice for retrieving over 1 million files from S3 > > Hi Darin, > > You should read this article. TextFile is very inefficient in S3. > > http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-s

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
Hi Darin, You should read this article. TextFile is very inefficient in S3. http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Cheers On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath wrote: > I'm looking for some suggestions based on

Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Darin McBeath
I'm looking for some suggestions based on other's experiences. I currently have a job that I need to run periodically where I need to read on the order of 1+ million files from an S3 bucket. It is not the entire bucket (nor does it match a pattern). Instead, I have a list of random keys that