Re: handling backfill overtime

Lance Norskog Fri, 26 Aug 2016 16:21:18 -0700

Is the problem that you lose the data, or the database?
If it's that you lose the DB, try using a permanent MySQL instance (or even
RDS) for your DB.
If it's that you lose your "digested" Hive data, you can do snapshots of
the disk set.




On Thu, Aug 25, 2016 at 1:37 PM, Nadeem Ahmed Nazeer <[email protected]>
wrote:

> Hello Airflowers,
>
> Does someone see a better way to do this? It would really my Airflow set
> up.
>
> Thanks,
> Nadeem
>
> On Thu, Aug 11, 2016 at 3:09 PM, Nadeem Ahmed Nazeer <[email protected]>
> wrote:
>
> > Hello,
> >
> > My airflow dag consists of 2 tasks,
> > 1) map-reduce jobs (writes output to s3)
> > 2) hive loads (using files from 1)
> >
> > My EMR hadoop cluster is running on aws spot instances. So when spot
> > instance pricing go up, my cluster would die and a new one would come up.
> >
> > In the event of a cluster death, i am clearing all the hive load tasks
> > from Airflow. This way it would rebuild the tables back in the new
> cluster
> > based on the files in s3.
> >
> > But overtime, when the backfill becomes very large this approach becomes
> > inefficient. My dag run frequency is 3 hours (8 runs a day). So for
> > example, if the cluster goes down after a month, airflow will now have to
> > backfill 240 (8 * 30) tasks that got cleared. This backfill only gets
> > bigger with time.
> >
> > What could be a better way to handle this? Currently, I'm planning to
> > re-base airflow manually once in a month where in I will bring down
> > everything and run airflow with new start date of current day. This will
> > reduce the backfill and keep it under limits of a month. But there's got
> to
> > be a better way of doing this.
> >
> > Please provide any suggestions.
> >
> > Thanks,
> > Nadeem
> >
>



-- 
Lance Norskog
[email protected]
Redwood City, CA

Re: handling backfill overtime

Reply via email to