Hello, My airflow dag consists of 2 tasks, 1) map-reduce jobs (writes output to s3) 2) hive loads (using files from 1)
My EMR hadoop cluster is running on aws spot instances. So when spot instance pricing go up, my cluster would die and a new one would come up. In the event of a cluster death, i am clearing all the hive load tasks from Airflow. This way it would rebuild the tables back in the new cluster based on the files in s3. But overtime, when the backfill becomes very large this approach becomes inefficient. My dag run frequency is 3 hours (8 runs a day). So for example, if the cluster goes down after a month, airflow will now have to backfill 240 (8 * 30) tasks that got cleared. This backfill only gets bigger with time. What could be a better way to handle this? Currently, I'm planning to re-base airflow manually once in a month where in I will bring down everything and run airflow with new start date of current day. This will reduce the backfill and keep it under limits of a month. But there's got to be a better way of doing this. Please provide any suggestions. Thanks, Nadeem
