We were struggling with the same problem and came up with fileflow
<http://github.com/industrydive/fileflow> which is what we wrote to deal
with passing data down a DAG in Airflow. We co-opt Airflow's task
dependency system to represent the data dependencies and let fileflow
handle knowing where the data is stored and how to get at it from
downstream. We've considered rolling something in fileflow that allows you
to specify the data dependencies more naturally (i.e.
task.data_dependency(other_task)) as right now the code you have to write
to manage the data dependencies is split between the co-opted Airflow task
dependency system (.set_upstream() and the like) and operator args, but we
haven't even started to think how to interject to the task dependency
system as a plugin.

We definitely felt this was missing especially since we were coming from
the POV not of orchestrating workflows triggering external tools (like an
external spark or hadoop job), but pipelining arbitrary Python scripts
together we run in our airflow workers themselves that pass their outputs
to each other; closer to what you would write a Makefile for but we wanted
to get all the nice Airflow scheduling, queue management, and workflow
profiling for free :)

Laura

On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi All,
>
> I came by a write up of some of the downsides in current workflow
> management systems like Airflow and Luigi (http://bionics.it/posts/
> workflows-dataflow-not-task-deps) where they argue dependencies should be
> between inputs and outputs of tasks rather than between tasks
> (inlets/outlets).
>
> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
> even published a scientific paper on it: http://jcheminf.springeropen.
> com/articles/10.1186/s13321-016-0179-6 .
>
> I kind of like the idea, has anyone played with it, any thoughts? I might
> want to try it in Airflow.
>
> Bolke

Reply via email to