I have setup a small development cluster using t2.micro machines and an Amazon Linux AMI (CentOS 6.x). The whole setup has been done manually, without using the provided scripts. The whole setup is composed of a total of 5 instances: the first machine has an elastic IP and it is used as a bridge to access the other 4 machines (they don't have elastic IPs). The second machine runs a standalone single node Spark cluster (1 master, 1 worker). The other 3 machines are configured as an Apache Cassandra cluster. I have tuned the JVM and lots of parameters. I do not use S3 nor HDFS, I just write data using Spark Streaming (from an Apache Flume sink) to the 3 Cassandra nodes, that I use for data retrieval. Data is then processed through regular Spark jobs. The jobs are submitted to the cluster using LinkedIn Azkaban, executing custom shell scripts written by me for wrapping the submitting process and handling eventual command line arguments, at scheduled intervals. Results are written directly to other Cassandra tables or in a specific folder on the filesystem using the regular CSV format. The system is completely autonomous and requires little to no manual administration.

I'm quite satisfied with it, considering how small and limited the machines involved are. But it required lots of tuning work, because we are clearly under the recommended requirements. 4 of the 5 machines are switched off during the night, only the bridge machine is alive 24/7.

12$ per month in total.

Renato Perini.


Il 28/04/2016 23:39, Fatma Ozcan ha scritto:
What is your experience using Spark on AWS? Are you setting up your own Spark cluster, and using HDFS? Or are you using Spark as a service from AWS? In the latter case, what is your experience of using S3 directly, without having HDFS in between?

Thanks,
Fatma


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to