use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster for directory lookups than jets3t
> On 13 Jan 2016, at 11:42, Darin McBeath <ddmcbe...@yahoo.com.INVALID> wrote: > > I'm looking for some suggestions based on other's experiences. > > I currently have a job that I need to run periodically where I need to read > on the order of 1+ million files from an S3 bucket. It is not the entire > bucket (nor does it match a pattern). Instead, I have a list of random keys > that are 'names' for the files in this S3 bucket. The bucket itself will > contain upwards of 60M or more files. > > My current approach has been to get my list of keys, partition on the key, > and then map this to an underlying class that uses the most recent AWS SDK to > retrieve the file from S3 using this key, which then returns the file. So, > in the end, I have an RDD<String>. This works, but I really wonder if this > is the best way. I suspect there might be a better/faster way. > > One thing I've been considering is passing all of the keys (using s3n: urls) > to sc.textFile or sc.wholeTextFiles(since some of my files can have embedded > newlines). But, I wonder how either of these would behave if I passed > literally a million (or more) 'filenames'. > > Before I spend time exploring, I wanted to seek some input. > > Any thoughts would be appreciated. > > Darin. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org