[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998681#comment-15998681 ]
Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:42 PM: ----------------------------------------------------------- I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd40eeee67_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 - (installs spark on each airflow worker node and runs local spark jobs without use of spark submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd - (alternative mozilla implementation for emr spark job) https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py was (Author: al.johri): I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd40eeee67_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 - (installs spark on each airflow worker node and runs local spark jobs without use of spark submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py > EMR Hook, Operators, Sensor > --------------------------- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature > Reporter: Rob Froetscher > Assignee: Rob Froetscher > Priority: Minor > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v6.3.15#6346)