Hey Daniel, We also run airflow on docker and use EMR.
I wrote a PR <https://github.com/apache/incubator-airflow/pull/1630> to address EMR resources for airflow. It has been merged but has not been released. The idea is that you can have your config as connections in the DB and use the operators to interact with your cluster and the sensors to wait for any action. There are two good example dags in the PR. https://github.com/apache/incubator-airflow/pull/1630/files We are using this currently for several jobs. Happy to answer any questions you have about how we use it. Best, Rob On Mon, Sep 12, 2016 at 2:10 PM, Daniel Siegmann < [email protected]> wrote: > Does anyone have experience using Airflow to launch Spark jobs on an Amazon > EMR cluster? > > I have an Airflow cluster - separate from my EMR cluster - built as docker > containers. I want to have Airflow submit jobs to an existing EMR cluster > (though in the future I want to have Airflow start and stop clusters). > > I could copy the Hadoop configs from EMR to each of the Airflow nodes, but > that's a pain. It'll be even more of a pain when I want to have Airflow > create and destroy clusters. So I'd rather not take this approach. > > The only alternative I can think of is to use SSH to execute the > spark-submit command on the EMR master node. This is simple enough, except > Airflow will need the identity file to get access by SSH. Just copying the > identity file to the Airflow nodes is problematic because it's in docker > and I don't want this file in my Git repo. > > Is there anyone with a similar setup that would care to share their > solution? > > -- > Daniel Siegmann > Senior Software Engineer > *SecurityScorecard Inc.* > 214 W 29th Street, 5th Floor > New York, NY 10001 >
