this is a good discussion. Most of traditional ETL tools (SSIS,
Informatica, DataStage etc.) have both - control flow (or task dependency)
and data flow. Some tools like SSIS make a clear distinction between them -
you create a control flow that calls data flows as a part of overall
control flow. Some tools like Alteryx were data flow only which is also
very limiting.

I am pretty sure same concepts will come to hadoop world as we saw it with
SQL (just look how many SQL on hadoop engines on the market now and how
everyone tried not to do SQL 5 years ago). Same thing with UI drag-and-drop
based tools - just give it some time and we will see a new wave of tools
that would not require Python, Java or Scala knowledge and will lower the
requirements for the skillset.

I do agree that limiting Airflow as a task dependency tool is not a good
strategy in a long run like I never liked the concept of backfills in
Airflow - ideally the tool should give a choice and support many design
patterns.

At this very moment though, I found Airflow to be the best tool for the job
and the fact that it does not support data flows in hadoop world is not a
deal breaker. Most of the hadoop pipelines are around gluing different
steps and tools which Airflow is extremely good at!

On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi All,
>
> I came by a write up of some of the downsides in current workflow
> management systems like Airflow and Luigi (http://bionics.it/posts/
> workflows-dataflow-not-task-deps) where they argue dependencies should be
> between inputs and outputs of tasks rather than between tasks
> (inlets/outlets).
>
> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
> even published a scientific paper on it: http://jcheminf.springeropen.
> com/articles/10.1186/s13321-016-0179-6 .
>
> I kind of like the idea, has anyone played with it, any thoughts? I might
> want to try it in Airflow.
>
> Bolke

Reply via email to