Re: Best practice for retrieving over 1 million files from S3

Steve Loughran Wed, 13 Jan 2016 13:04:51 -0800

use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster 
for directory lookups than jets3t


> On 13 Jan 2016, at 11:42, Darin McBeath <ddmcbe...@yahoo.com.INVALID> wrote:
> 
> I'm looking for some suggestions based on other's experiences.
> 
> I currently have a job that I need to run periodically where I need to read 
> on the order of 1+ million files from an S3 bucket.  It is not the entire 
> bucket (nor does it match a pattern).  Instead, I have a list of random keys 
> that are 'names' for the files in this S3 bucket.  The bucket itself will 
> contain upwards of 60M or more files.
> 
> My current approach has been to get my list of keys, partition on the key, 
> and then map this to an underlying class that uses the most recent AWS SDK to 
> retrieve the file from S3 using this key, which then returns the file.  So, 
> in the end, I have an RDD<String>.  This works, but I really wonder if this 
> is the best way.  I suspect there might be a better/faster way.
> 
> One thing I've been considering is passing all of the keys (using s3n: urls) 
> to sc.textFile or sc.wholeTextFiles(since some of my files can have embedded 
> newlines).  But, I wonder how either of these would behave if I passed 
> literally a million (or more) 'filenames'.
> 
> Before I spend time exploring, I wanted to seek some input.
> 
> Any thoughts would be appreciated.
> 
> Darin.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for retrieving over 1 million files from S3

Reply via email to