We were struggling with the same problem and came up with fileflow <http://github.com/industrydive/fileflow> which is what we wrote to deal with passing data down a DAG in Airflow. We co-opt Airflow's task dependency system to represent the data dependencies and let fileflow handle knowing where the data is stored and how to get at it from downstream. We've considered rolling something in fileflow that allows you to specify the data dependencies more naturally (i.e. task.data_dependency(other_task)) as right now the code you have to write to manage the data dependencies is split between the co-opted Airflow task dependency system (.set_upstream() and the like) and operator args, but we haven't even started to think how to interject to the task dependency system as a plugin.
We definitely felt this was missing especially since we were coming from the POV not of orchestrating workflows triggering external tools (like an external spark or hadoop job), but pipelining arbitrary Python scripts together we run in our airflow workers themselves that pass their outputs to each other; closer to what you would write a Makefile for but we wanted to get all the nice Airflow scheduling, queue management, and workflow profiling for free :) Laura On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <bdbr...@gmail.com> wrote: > Hi All, > > I came by a write up of some of the downsides in current workflow > management systems like Airflow and Luigi (http://bionics.it/posts/ > workflows-dataflow-not-task-deps) where they argue dependencies should be > between inputs and outputs of tasks rather than between tasks > (inlets/outlets). > > They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and > even published a scientific paper on it: http://jcheminf.springeropen. > com/articles/10.1186/s13321-016-0179-6 . > > I kind of like the idea, has anyone played with it, any thoughts? I might > want to try it in Airflow. > > Bolke