I have setup a small development cluster using t2.micro machines and an
Amazon Linux AMI (CentOS 6.x).
The whole setup has been done manually, without using the provided
scripts. The whole setup is composed of a total of 5 instances: the
first machine has an elastic IP and it is used as a bridge to access the
other 4 machines (they don't have elastic IPs). The second machine runs
a standalone single node Spark cluster (1 master, 1 worker). The other 3
machines are configured as an Apache Cassandra cluster. I have tuned the
JVM and lots of parameters. I do not use S3 nor HDFS, I just write data
using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra
nodes, that I use for data retrieval. Data is then processed through
regular Spark jobs.
The jobs are submitted to the cluster using LinkedIn Azkaban, executing
custom shell scripts written by me for wrapping the submitting process
and handling eventual command line arguments, at scheduled intervals.
Results are written directly to other Cassandra tables or in a specific
folder on the filesystem using the regular CSV format.
The system is completely autonomous and requires little to no manual
administration.
I'm quite satisfied with it, considering how small and limited the
machines involved are. But it required lots of tuning work, because we
are clearly under the recommended requirements. 4 of the 5 machines are
switched off during the night, only the bridge machine is alive 24/7.
12$ per month in total.
Renato Perini.
Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
What is your experience using Spark on AWS? Are you setting up your
own Spark cluster, and using HDFS? Or are you using Spark as a service
from AWS? In the latter case, what is your experience of using S3
directly, without having HDFS in between?
Thanks,
Fatma
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org