On 28 Apr 2017, at 16:10, Anubhav Agarwal
> wrote:
Are you using Spark's textFiles method? If so, go through this blog :-
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
old/dated blog post.
If you get the Hadoop 2.8 binaries on your classpath, s3a does a full directory
tree listing if you give it a simple path like "s3a://bucket/events". The
example in that post was using a complex wildcard which hasn't yet been speeded
up as it's pretty hard to do it in a way which works effectively everywhere.
Having all your data in 1 dir works nicely.
Anubhav
On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia
> wrote:
Hi there,
I have a process that downloads thousands of files from s3 bucket, removes a
set of columns from it, and upload it to s3.
S3 is currently not the bottleneck, having a Single Master Node Spark instance
is the bottleneck. One approach is to distribute the files on multiple Spark
Master Node workers, that will make it faster.
yes, > 1 worker and, if the work can be partitioned
Question:
1. Is there a way to utilize master / slave node on Spark to distribute
this downloading and processing of files – so it can say do 10 files at a time?
yes, they are called RDDs/Dataframes & Datasets
If you are doing all the processing on the spark driver, then you aren't really
using spark much, more just processing them in Scala
To get a dataframe
val df = SparkSession.read.format("csv").load("s3a://bucket/data")
You now have a dataset on all files in the directory /data in the bucket, which
will be partitioned how spark decides (which depends on: # of workers,
compression format used and its splittability). Assuming you can configure the
dataframe with the column structure, you can filter aggressively by selecting
only those columns you want
val filteredDf = df.select("rental", "start_time")
filteredDf.save(hdfs://final/processed")
then, once you've got all the data done, copy them up to S3 via distcp
I'd recommend you start doing this with a small number of files locally,
getting the code working, then see if you can use it with s3 as the source/dest
of data, again, locally if you want (it's just slow), then move to in-EC2 for
the bandwidth.
Bandwidth wise, there are some pretty major performance issues with the s3n
connector, S3a in Hadoop 2.7+ works, with Hadoop 2.8 having a lot more
speedupm, especially when using orc and parquet as a source, where there's a
special "random access mode".
futrher reading
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-spark/index.html
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-performance/index.html
2. Is there a way to scale workers with Spark downloading and processing
files, even if they are all Single Master Node?
I think there may be some terminology confusion here. You are going to have to
have one process which is the spark driver: either on your client machine,
deployed somewhere in the cluster via YARN/Mesos, or running on a static
location withing a spark standalone cluster. Everything other than the driver
process is a work, which will do the work.