Data Vault on Hive + AIrflow example

2018-02-28 Thread Gerard Toonstra
Yesterday I finished the draft of a new example on the "ETL with airflow" site. This example explores the concept of a "Data vault" methodology on top of Hive, 100% orchestrated by airflow: https://gtoonstra.github.io/etl-with-airflow/datavault2.html The theory of the data vault is that you can

Re: How to add hooks for strong deployment consistency?

2018-02-28 Thread David Capwell
Thanks for all the details! With a pluggable fetcher we would be able to add our own logic for how to fetch so sounds like a good place to start for something like this! On Wed, Feb 28, 2018, 4:39 PM Joy Gao wrote: > +1 on DagFetcher abstraction, very airflow-esque :) > > On

Re: How to add hooks for strong deployment consistency?

2018-02-28 Thread Joy Gao
+1 on DagFetcher abstraction, very airflow-esque :) On Wed, Feb 28, 2018 at 11:25 AM, Maxime Beauchemin wrote: > Addressing a few of your questions / concerns: > > * The scheduler uses a multiprocess queue to queue up tasks, each > subprocess is in charge of a single

Re: How to add hooks for strong deployment consistency?

2018-02-28 Thread Maxime Beauchemin
Addressing a few of your questions / concerns: * The scheduler uses a multiprocess queue to queue up tasks, each subprocess is in charge of a single DAG "scheduler cycle" which triggers what it can for active DagRuns. Currently it fills the DagBag from the local file system, looking for a

Re: Ash Berlin-Taylor joins Apache Airflow as committer and PPMC member

2018-02-28 Thread Chris Riccomini
Welcome! :) On Sun, Feb 25, 2018 at 5:16 PM, Maxime Beauchemin < maximebeauche...@gmail.com> wrote: > Congrats and welcome! > > On Sat, Feb 24, 2018 at 1:21 PM, Naik Kaxil wrote: > > > Congrats Ash ( > > > > On 24/02/2018, 20:23, "fo...@driesprongen.nl on behalf of

Re: How to add hooks for strong deployment consistency?

2018-02-28 Thread Chris Palmer
I'll preface this with the fact that I'm relatively new to Airflow, and haven't played around with a lot of the internals. I find the idea of a DagFetcher interesting but would we worry about slowing down the scheduler significantly? If the scheduler is having to "fetch" multiple different DAG