Managing external resources in Airflow

Jonathan Miles Tue, 15 Oct 2019 13:41:39 -0700

Hi all,

Posting this to dev instead of users because it crosses into frameworkterritory.

I've been using Airflow for six months or so and I'm starting to thinkabout how to better manage Airflow tasks that are proxies for computetasks running elsewhere, e.g. steps on Amazon EMR clusters. It's easyenough use a DAG with the various existing Emr...Operators to createclusters, add steps and tear down. However, with large numbers ofparallel steps it's hard to manage the creation of EMR clusters, reusethem in various steps and potentially even dynamically scale the EMRcluster count. I want something more akin to a producer/consumer queuefor EMR steps. Before I write an AIP, I want to see if anyone's aware ofany other work or development in this area?


*Example Problem 1*: cluster reuse.

A workflow might need to spin up and execute steps on multiple EMRclusters of different sizes, e.g.


- EMR Cluster A
-- Phase 1 EMR Steps
-- Phase 2 EMR Steps
-- Phase 3 EMR Steps
-- Phase 4 EMR Steps

- EMR Cluster B
-- Phase 5 EMR Steps
-- Phase 6 EMR Steps

- EMR Cluster C
-- Phase 7 EMR Steps
-- Phase 8 EMR Steps

The above steps are serial, requiring the previous phase to finish. Themost basic way to model it in a DAG is to use EmrCreateJobFlowOperatorfor a cluster, then a set of EmrAddStepsOperator and EmrStepSensor pairsfor each phase and finally terminate the cluster withEmrTerminateJobFlowOperator. We can use XCom to fetch the cluster id forAddSteps/TerminateJobFlow and to fetch the step ids for the StepSensor.


- EmrCreateJobFlowOperator
- Phase 1: EmrAddStepsOperator + EmrStepSensor
- Phase 2: EmrAddStepsOperator + EmrStepSensor
- Phase 3: EmrAddStepsOperator + EmrStepSensor
- Phase 4: EmrAddStepsOperator + EmrStepSensor
- EmrTerminateJobFlowOperator

One problem here is that if the underlying EMR cluster fails at any time- e.g. it could be using spot EC2 instances and AWS capacity runs out -someone needs to manually attend to the failed task instance: restartthe EMR cluster; then reset the state of the failed AddSteps/StepSensortask pairs. It needs close supervision.

There are other ways to model this workflow in a DAG with differenttrade-offs, e.g.

1. Put each phase in a SubDag then create and terminate the cluster foreach phase, with any failed task causing the whole SubDag to retry. Butthis adds extra total duration due to stopping/starting the cluster foreach phase, which is not insignificant.

2. Write custom operators. One to represent an EMR cluster as aEmrSubDagOperator to create and eventually terminate cluster. Thenanother another to create sub-tasks that use XCom to fetch the clusterid from the "parent" SubDag, add the EMR steps and wait.

Ideally, however, I'd like Airflow to manage this itself: it could knowthat task instances require certain resources or resource instances,then start/stop them as required.


*Example Problem 2*: parallel tasks.

A workflow might be split into many parallel steps, e.g. for differentdata partitions or bins.


- Phase P EMR Steps
-- Parallel EMR step for bin 0
-- ...
-- Parallel EMR step for bin N

The above steps are independent, where they compute a subset of thelarger problem, so can be run in parallel. A basic way to model this asa DAG is to create as many branches as the desired parallelism level,e.g. for parallelism of two:


- EmrCreateJobFlowOperator
-- Bin 0: EmrAddStepsOperator + EmrStepSensor
-- ...
-- Bin max(even(B)): EmrAddStepsOperator + EmrStepSensor
-- EmrTerminateJobFlowOperator

- EmrCreateJobFlowOperator
-- Bin 1: EmrAddStepsOperator + EmrStepSensor
-- ...
-- Bin max(odd(B)): EmrAddStepsOperator + EmrStepSensor
-- EmrTerminateJobFlowOperator

This has the same management problem as the previous example whenclusters fail, but also another challenge: the parallelism level isstatically coded into the DAG topology. It's hard to scale up/down andrebalance the tasks for the bins.

I could create a separate "EMR cluster" management service outside ofAirflow and write a custom Airflow operator to put EMR steps into aqueue then have that service auto-scale depending on queue depth etc. Ifthe queue were Celery, it starts to look like a specialised AirflowExecutor.


*Solutions*:

Without yet doing much research, I've considered some competingsolutions for both problems:

1. Service for managing resources. Custom operator to communicate withthe service and schedule an atomic set of "EMR steps" with a given EMRcluster specification. Service decides whether or not to reuse anexisting cluster or spin up a new one, can auto-scale etc. We canrepresent both serial or parallel step patterns in the Airflow DAG itself.


2. Build resource management into Airflow

2a. Allow tasks to specify *resource* dependencies. As written above, anew dimension of dependencies that lets Airflow manage when instances ofa resource should be spun up or otherwise acquired.

2b. Allow Airflow to have multiple Executors (could be an implementationof 2a), e.g. EmrClusterExecutor. The scheduler still does its thing, buttasks are run on a specialised kind of executor that understands EMR steps.

Any thoughts? Has anyone worked on this set of problems before? I'mspecifically looking at EMR right now, but I suspect there are manyother use-cases.


Regards,

Jon

Managing external resources in Airflow

Reply via email to