When we first started using airflow we had legacy systems that were using Autosys, and cloud systems using airflow and we needed to bridge the gap between them. We also had an additional desire to move away from timed based scheduling / processing to more of a dependency driven model. Our answer was to create an external dependency management micros service that allowed for all this, but we had the desire to somehow integrate this with Airflow as our teams did not like having to "create" jobs twice.
As we've evolved, yet another case has come up and that is streaming / push driven workflows. Our existing dependency service really doesn't deal with these yet, and we've had some challenges integrating these cleanly with airflow because at the time there was no way to externally trigger DAGs. Ideally it would be great to come up with a solve that covers all 3 scheduling needs. Where my mind is right now is if it is best to solve this entirely an airflow solution, or something more external to allow it to work across multiple systems with strong integration with airflow. Jesse ________________________________ From: Boris Tyukin <bo...@boristyukin.com> Sent: Monday, January 23, 2017 10:41:47 AM To: dev@airflow.incubator.apache.org Subject: Re: Flow-based Airflow? this is a good discussion. Most of traditional ETL tools (SSIS, Informatica, DataStage etc.) have both - control flow (or task dependency) and data flow. Some tools like SSIS make a clear distinction between them - you create a control flow that calls data flows as a part of overall control flow. Some tools like Alteryx were data flow only which is also very limiting. I am pretty sure same concepts will come to hadoop world as we saw it with SQL (just look how many SQL on hadoop engines on the market now and how everyone tried not to do SQL 5 years ago). Same thing with UI drag-and-drop based tools - just give it some time and we will see a new wave of tools that would not require Python, Java or Scala knowledge and will lower the requirements for the skillset. I do agree that limiting Airflow as a task dependency tool is not a good strategy in a long run like I never liked the concept of backfills in Airflow - ideally the tool should give a choice and support many design patterns. At this very moment though, I found Airflow to be the best tool for the job and the fact that it does not support data flows in hadoop world is not a deal breaker. Most of the hadoop pipelines are around gluing different steps and tools which Airflow is extremely good at! On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <bdbr...@gmail.com> wrote: > Hi All, > > I came by a write up of some of the downsides in current workflow > management systems like Airflow and Luigi (http://bionics.it/posts/ > workflows-dataflow-not-task-deps) where they argue dependencies should be > between inputs and outputs of tasks rather than between tasks > (inlets/outlets). > > They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and > even published a scientific paper on it: http://jcheminf.springeropen. > com/articles/10.1186/s13321-016-0179-6 . > > I kind of like the idea, has anyone played with it, any thoughts? I might > want to try it in Airflow. > > Bolke