I guess my big question would be why do you have so many files? Is there no possibility that you can merge a lot of those files together before processing them?
On Wed, Jan 13, 2016 at 11:59 AM Darin McBeath <ddmcbe...@yahoo.com> wrote: > Thanks for the tip, as I had not seen this before. That's pretty much > what I'm doing already. Was just thinking there might be a better way. > > Darin. > ------------------------------ > *From:* Daniel Imberman <daniel.imber...@gmail.com> > *To:* Darin McBeath <ddmcbe...@yahoo.com>; User <user@spark.apache.org> > *Sent:* Wednesday, January 13, 2016 2:48 PM > *Subject:* Re: Best practice for retrieving over 1 million files from S3 > > Hi Darin, > > You should read this article. TextFile is very inefficient in S3. > > http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 > > Cheers > > On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath <ddmcbe...@yahoo.com.invalid> > wrote: > > I'm looking for some suggestions based on other's experiences. > > I currently have a job that I need to run periodically where I need to > read on the order of 1+ million files from an S3 bucket. It is not the > entire bucket (nor does it match a pattern). Instead, I have a list of > random keys that are 'names' for the files in this S3 bucket. The bucket > itself will contain upwards of 60M or more files. > > My current approach has been to get my list of keys, partition on the key, > and then map this to an underlying class that uses the most recent AWS SDK > to retrieve the file from S3 using this key, which then returns the file. > So, in the end, I have an RDD<String>. This works, but I really wonder if > this is the best way. I suspect there might be a better/faster way. > > One thing I've been considering is passing all of the keys (using s3n: > urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have > embedded newlines). But, I wonder how either of these would behave if I > passed literally a million (or more) 'filenames'. > > Before I spend time exploring, I wanted to seek some input. > > Any thoughts would be appreciated. > > Darin. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > >