Integration with Apache Airflow

Bolke de Bruin Tue, 01 May 2018 09:01:24 -0700

Hi There.

I’m one of the maintainers of Apache Airflow (incubating). Airflow is a task 
based orchestration (workflow) engine. A long time wish of our community is to 
have lineage information available. I have been tinkering with integration 
Apache Atlas for a couple of days now, but some questions remain. I hope you 
can help.


We have a concept called “Operators” which translates to a Process most of the 
time. So for example we have a SFTPOperator which copies a file to a certain 
destination. We also have a SparkSubmitOperator. Here it becomes a bit gray to 
me. Basically what we do here is to kick off a Spark script. So what happens is 
that you can consider our Operator to have the exact same inputs and outputs as 
the Spark script seen from data perspective, it also performs the same 
processing. There are a couple of challenges:

1. In case Spark does emit lineage information (I know that is being worked on) 
our SparkSubmitOperator basically emits redundant information. However, it is 
unknown to the Operator if Spark does emit lineage information and often it is 
not the case (older version) What is the best course action? If the Operator 
does emit lineage information does that make sense? How will this end up in the 
Atlas UI?

2. The conventions on what a Process accepts as inputs and supplies as outputs 
seem to be very loose and dependent on single type of systems (Spark, Flink, 
Whatever). They are also subject to change. So what do I supply as inputs to 
Spark and pickup from Spark what it considers its outputs? The qualifiedName 
for spark seems to be “ApplicationID.ExecutionId” (at the moment). How do I get 
that information *outside* of Spark?

3. What do you consider the best way of integration? (Related to #1, #2) Do we 
live in our own ecosystem and define our own Processes and Entities etc? Or do 
we integrate somewhere on the model level? 

Im the first to admit that I could be misunderstanding many things here. But I 
hope you like to educate me here :).

Regards,

Bolke

Integration with Apache Airflow

Reply via email to