Hello,

My airflow dag consists of 2 tasks,
1) map-reduce jobs (writes output to s3)
2) hive loads (using files from 1)

My EMR hadoop cluster is running on aws spot instances. So when spot
instance pricing go up, my cluster would die and a new one would come up.

In the event of a cluster death, i am clearing all the hive load tasks from
Airflow. This way it would rebuild the tables back in the new cluster based
on the files in s3.

But overtime, when the backfill becomes very large this approach becomes
inefficient. My dag run frequency is 3 hours (8 runs a day). So for
example, if the cluster goes down after a month, airflow will now have to
backfill 240 (8 * 30) tasks that got cleared. This backfill only gets
bigger with time.

What could be a better way to handle this? Currently, I'm planning to
re-base airflow manually once in a month where in I will bring down
everything and run airflow with new start date of current day. This will
reduce the backfill and keep it under limits of a month. But there's got to
be a better way of doing this.

Please provide any suggestions.

Thanks,
Nadeem

Reply via email to