Florian Verhein created SPARK-5552:
--------------------------------------

             Summary: Automated data science AMIs creation and cluster 
deployment on EC2
                 Key: SPARK-5552
                 URL: https://issues.apache.org/jira/browse/SPARK-5552
             Project: Spark
          Issue Type: New Feature
          Components: EC2
            Reporter: Florian Verhein


Issue created RE: 
https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
for background)

Goal:
Extend spark-ec2 scripts to create an automated data science cluster deployment 
on EC2, suitable for almost(?)-production use.

Use cases: 
- A user can build their own custom data science AMIs from a CentOS minimal 
image by calling a packer configuration (good defaults should be provided, some 
options for flexibility)
- A user can then easily deploy a new (correctly configured) cluster using 
these AMIs, and do so as quickly as possible.

Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + 
vowpal wabbit + any rpms + ... + ganglia

Focus is on reliability (rather than e.g. supporting many versions / dev 
testing) and speed of deployment.
Use hadoop 2 so option to lift into yarn later.

My current solution is here: 
https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
fixes/improvements as needed to get it working.

Now that it seems to work (but has deviated a lot more from the existing code 
base than I was expecting), I'm wondering what to do with it...

Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to