Re: Best practice for retrieving over 1 million files from S3

Daniel Imberman Wed, 13 Jan 2016 12:02:34 -0800

I guess my big question would be why do you have so many files? Is there no
possibility that you can merge a lot of those files together before
processing them?


On Wed, Jan 13, 2016 at 11:59 AM Darin McBeath <ddmcbe...@yahoo.com> wrote:

> Thanks for the tip, as I had not seen this before.  That's pretty much
> what I'm doing already.  Was just thinking there might be a better way.
>
> Darin.
> ------------------------------
> *From:* Daniel Imberman <daniel.imber...@gmail.com>
> *To:* Darin McBeath <ddmcbe...@yahoo.com>; User <user@spark.apache.org>
> *Sent:* Wednesday, January 13, 2016 2:48 PM
> *Subject:* Re: Best practice for retrieving over 1 million files from S3
>
> Hi Darin,
>
> You should read this article. TextFile is very inefficient in S3.
>
> http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
>
> Cheers
>
> On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath <ddmcbe...@yahoo.com.invalid>
> wrote:
>
> I'm looking for some suggestions based on other's experiences.
>
> I currently have a job that I need to run periodically where I need to
> read on the order of 1+ million files from an S3 bucket.  It is not the
> entire bucket (nor does it match a pattern).  Instead, I have a list of
> random keys that are 'names' for the files in this S3 bucket.  The bucket
> itself will contain upwards of 60M or more files.
>
> My current approach has been to get my list of keys, partition on the key,
> and then map this to an underlying class that uses the most recent AWS SDK
> to retrieve the file from S3 using this key, which then returns the file.
> So, in the end, I have an RDD<String>.  This works, but I really wonder if
> this is the best way.  I suspect there might be a better/faster way.
>
> One thing I've been considering is passing all of the keys (using s3n:
> urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have
> embedded newlines).  But, I wonder how either of these would behave if I
> passed literally a million (or more) 'filenames'.
>
> Before I spend time exploring, I wanted to seek some input.
>
> Any thoughts would be appreciated.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

Re: Best practice for retrieving over 1 million files from S3

Reply via email to