Florian Verhein created SPARK-5552: -------------------------------------- Summary: Automated data science AMIs creation and cluster deployment on EC2 Key: SPARK-5552 URL: https://issues.apache.org/jira/browse/SPARK-5552 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein
Issue created RE: https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read for background) Goal: Extend spark-ec2 scripts to create an automated data science cluster deployment on EC2, suitable for almost(?)-production use. Use cases: - A user can build their own custom data science AMIs from a CentOS minimal image by calling a packer configuration (good defaults should be provided, some options for flexibility) - A user can then easily deploy a new (correctly configured) cluster using these AMIs, and do so as quickly as possible. Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + vowpal wabbit + any rpms + ... + ganglia Focus is on reliability (rather than e.g. supporting many versions / dev testing) and speed of deployment. Use hadoop 2 so option to lift into yarn later. My current solution is here: https://github.com/florianverhein/spark-ec2/tree/packer. It includes other fixes/improvements as needed to get it working. Now that it seems to work (but has deviated a lot more from the existing code base than I was expecting), I'm wondering what to do with it... Keen to hear ideas if anyone is interested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org