When we first started using airflow we had legacy systems that were using 
Autosys, and cloud systems using airflow and we needed to bridge the gap 
between them. We also had an additional desire to move away from timed based 
scheduling / processing to more of a dependency driven model. Our answer was to 
create an external dependency management micros service that allowed for all 
this, but we had the desire to somehow integrate this with Airflow as our teams 
did not like having to "create" jobs twice.


As we've evolved, yet another case has come up and that is streaming / push 
driven workflows. Our existing dependency service really doesn't deal with 
these yet, and we've had some challenges integrating these cleanly with airflow 
because at the time there was no way to externally trigger DAGs.


Ideally it would be great to come up with a solve that covers all 3 scheduling 
needs. Where my mind is right now is if it is best to solve this entirely an 
airflow solution, or something more external to allow it to work across 
multiple systems with strong integration with airflow.


Jesse

________________________________
From: Boris Tyukin <bo...@boristyukin.com>
Sent: Monday, January 23, 2017 10:41:47 AM
To: dev@airflow.incubator.apache.org
Subject: Re: Flow-based Airflow?

this is a good discussion. Most of traditional ETL tools (SSIS,
Informatica, DataStage etc.) have both - control flow (or task dependency)
and data flow. Some tools like SSIS make a clear distinction between them -
you create a control flow that calls data flows as a part of overall
control flow. Some tools like Alteryx were data flow only which is also
very limiting.

I am pretty sure same concepts will come to hadoop world as we saw it with
SQL (just look how many SQL on hadoop engines on the market now and how
everyone tried not to do SQL 5 years ago). Same thing with UI drag-and-drop
based tools - just give it some time and we will see a new wave of tools
that would not require Python, Java or Scala knowledge and will lower the
requirements for the skillset.

I do agree that limiting Airflow as a task dependency tool is not a good
strategy in a long run like I never liked the concept of backfills in
Airflow - ideally the tool should give a choice and support many design
patterns.

At this very moment though, I found Airflow to be the best tool for the job
and the fact that it does not support data flows in hadoop world is not a
deal breaker. Most of the hadoop pipelines are around gluing different
steps and tools which Airflow is extremely good at!

On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi All,
>
> I came by a write up of some of the downsides in current workflow
> management systems like Airflow and Luigi (http://bionics.it/posts/
> workflows-dataflow-not-task-deps) where they argue dependencies should be
> between inputs and outputs of tasks rather than between tasks
> (inlets/outlets).
>
> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
> even published a scientific paper on it: http://jcheminf.springeropen.
> com/articles/10.1186/s13321-016-0179-6 .
>
> I kind of like the idea, has anyone played with it, any thoughts? I might
> want to try it in Airflow.
>
> Bolke

Reply via email to