Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
Anubhav On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia < bardia.afs...@capitalone.com> wrote: > Hi there, > > > > I have a process that downloads thousands of files from s3 bucket, removes > a set of columns from it, and upload it to s3. > > > > S3 is currently not the bottleneck, having a Single Master Node Spark > instance is the bottleneck. One approach is to distribute the files on > multiple Spark Master Node workers, that will make it faster. > > > > Question: > > 1. Is there a way to utilize master / slave node on Spark to > distribute this downloading and processing of files – so it can say do 10 > files at a time? > > 2. Is there a way to scale workers with Spark downloading and > processing files, even if they are all Single Master Node? > > > > Thanks, > > Bardia > > ------------------------------ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. >