Are you using Spark's textFiles method? If so, go through this blog :-
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219

Anubhav

On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia <
bardia.afs...@capitalone.com> wrote:

> Hi there,
>
>
>
> I have a process that downloads thousands of files from s3 bucket, removes
> a set of columns from it, and upload it to s3.
>
>
>
> S3 is currently not  the bottleneck, having a Single Master Node Spark
> instance is the bottleneck. One approach is to distribute the files on
> multiple Spark Master Node workers, that will make it faster.
>
>
>
> Question:
>
> 1.       Is there a way to utilize master / slave node on Spark to
> distribute this downloading and processing of files – so it can say do 10
> files at a time?
>
> 2.       Is there a way to scale workers with Spark downloading and
> processing files, even if they are all Single Master Node?
>
>
>
> Thanks,
>
> Bardia
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Reply via email to