Best practice for retrieving over 1 million files from S3

Darin McBeath Wed, 13 Jan 2016 11:45:02 -0800

I'm looking for some suggestions based on other's experiences.

I currently have a job that I need to run periodically where I need to read on 
the order of 1+ million files from an S3 bucket.  It is not the entire bucket 
(nor does it match a pattern).  Instead, I have a list of random keys that are 
'names' for the files in this S3 bucket.  The bucket itself will contain 
upwards of 60M or more files.


My current approach has been to get my list of keys, partition on the key, and 
then map this to an underlying class that uses the most recent AWS SDK to 
retrieve the file from S3 using this key, which then returns the file.  So, in 
the end, I have an RDD<String>.  This works, but I really wonder if this is the 
best way.  I suspect there might be a better/faster way.

One thing I've been considering is passing all of the keys (using s3n: urls) to 
sc.textFile or sc.wholeTextFiles(since some of my files can have embedded 
newlines).  But, I wonder how either of these would behave if I passed 
literally a million (or more) 'filenames'.

Before I spend time exploring, I wanted to seek some input.

Any thoughts would be appreciated.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Best practice for retrieving over 1 million files from S3

Reply via email to