Does anyone have experience using Airflow to launch Spark jobs on an Amazon EMR cluster?
I have an Airflow cluster - separate from my EMR cluster - built as docker containers. I want to have Airflow submit jobs to an existing EMR cluster (though in the future I want to have Airflow start and stop clusters). I could copy the Hadoop configs from EMR to each of the Airflow nodes, but that's a pain. It'll be even more of a pain when I want to have Airflow create and destroy clusters. So I'd rather not take this approach. The only alternative I can think of is to use SSH to execute the spark-submit command on the EMR master node. This is simple enough, except Airflow will need the identity file to get access by SSH. Just copying the identity file to the Airflow nodes is problematic because it's in docker and I don't want this file in my Git repo. Is there anyone with a similar setup that would care to share their solution? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001
