I'm looking for some suggestions based on other's experiences. I currently have a job that I need to run periodically where I need to read on the order of 1+ million files from an S3 bucket. It is not the entire bucket (nor does it match a pattern). Instead, I have a list of random keys that are 'names' for the files in this S3 bucket. The bucket itself will contain upwards of 60M or more files.
My current approach has been to get my list of keys, partition on the key, and then map this to an underlying class that uses the most recent AWS SDK to retrieve the file from S3 using this key, which then returns the file. So, in the end, I have an RDD<String>. This works, but I really wonder if this is the best way. I suspect there might be a better/faster way. One thing I've been considering is passing all of the keys (using s3n: urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have embedded newlines). But, I wonder how either of these would behave if I passed literally a million (or more) 'filenames'. Before I spend time exploring, I wanted to seek some input. Any thoughts would be appreciated. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org