Using Airflow with dataset dependant flows (not date)

Javier Domingo Cansino Fri, 18 May 2018 00:52:35 -0700

Hello Guys,

First of all, I have submitted the idea to JIRA[1], and after speaking with
the guys at gitter,
they told me to bring the discussion here too.


Right now Airflow only understands of being a date based scheduler. It is
extremely complete on
that sense, and makes it really easy to populate and backfill your DAGs.
Monitoring is quite
decent, and can be improved through plugins. Everything is code, as opposed
to most of the
alternatives out there[2][3][4], and you may or not depend on files
existing to go to the next
step. There is an UI that lets you visualize the status of your systems and
trigger manually
jobs.

There is a limitation however on running on dates only, and is that
sometimes there are DAGs
that will not depend on the date, but on the dataset. Some examples I am
close to:

  * Bioinf pipeline, where you process samples

  * Media pipeline, where you may process different videos/audios in the
same way

Right now I am using Snakemake for the first ones, and bash scripts for the
second one, however
I have thought that maybe Airflow could be a solution to these two problems.

I have been reading the code, and although the term execution_date is quite
coupled, it seems
like it could be doable to abstract the datatype of this parametrization
variable (datetime) and
extend it to be something that could depend on something else (string).

After all, for what I have seen execution_date is just the parametrization
variable.

Questions I would like to ask:

  * Is this some need you have had? If so, how did you solve it? Is there
any other tool with the
    features I described that could help me on that?

  * How do you recommend solving this with Airflow?

    * In gitter people has proposed forgetting about execution_dates, just
triggering the DAGs
      and parametrizing the run through variables. However this has the
drawback to lose execution
      tracking, and make impossible to run several DAGs at the same time
for different datasets

    * There was also the proposal to instantiate subDAGs per dataset, and
have one DAG where the
      first step is to read what are the samples to run on. The problem I
see with this is that
      you lose tracking on which samples have been run, and you cannot have
per sample historic
      data.

    * Airflow works good when you have datasets that change, therefore,
other solution would be
      to instantiate one DAG per sample, and then have a single execution.
However this sounds a
      bit overkill to me, because you would just have one DAGRun per DAG.

  * If this is something that would be interesting to you, and you would
like to see this usecase
    solved within airflow, please tell, as I am interested on making a
proposal that is both
    simple and works for everyone


Right now the best idea I have is:

  * Rename execution_date to parametrization_value changing it's datatype
to string. We
    ensure backward compatibility because already existing execution_date
can be serialized.

  * Create a new entity called parametrization_group, where we could make
groups of these
    parameters for the scheduler to know that it needs to trigger a DAGRun
on every DAG that
    depends on such group.

  * Extend a bit the cli to let it modify these parametrization_group.

  * Extend the scheduler to understand what parametrization_group DAGs
depend on, and trigger
    all the DAGs to run when new parametrization_group elements are added
in.

  * Enable backill to run without --start-date and --end-date when the DAGs
depend on
    parametrization_group, and with an optional --parametrization-values
that accepts a list
    to work on.

How does all this sound to you? Any ideas?


Cheers, Javier


[1] JIRA ticket for dataset related execution:
https://issues.apache.org/jira/browse/AIRFLOW-2480
[2] Awesome 1: https://github.com/meirwah/awesome-workflow-engines
[3] Awesome 2: https://github.com/pawl/awesome-etl
[4] Awesome 3: https://github.com/pditommaso/awesome-

Using Airflow with dataset dependant flows (not date)

Reply via email to