Hi There. I’m one of the maintainers of Apache Airflow (incubating). Airflow is a task based orchestration (workflow) engine. A long time wish of our community is to have lineage information available. I have been tinkering with integration Apache Atlas for a couple of days now, but some questions remain. I hope you can help.
We have a concept called “Operators” which translates to a Process most of the time. So for example we have a SFTPOperator which copies a file to a certain destination. We also have a SparkSubmitOperator. Here it becomes a bit gray to me. Basically what we do here is to kick off a Spark script. So what happens is that you can consider our Operator to have the exact same inputs and outputs as the Spark script seen from data perspective, it also performs the same processing. There are a couple of challenges: 1. In case Spark does emit lineage information (I know that is being worked on) our SparkSubmitOperator basically emits redundant information. However, it is unknown to the Operator if Spark does emit lineage information and often it is not the case (older version) What is the best course action? If the Operator does emit lineage information does that make sense? How will this end up in the Atlas UI? 2. The conventions on what a Process accepts as inputs and supplies as outputs seem to be very loose and dependent on single type of systems (Spark, Flink, Whatever). They are also subject to change. So what do I supply as inputs to Spark and pickup from Spark what it considers its outputs? The qualifiedName for spark seems to be “ApplicationID.ExecutionId” (at the moment). How do I get that information *outside* of Spark? 3. What do you consider the best way of integration? (Related to #1, #2) Do we live in our own ecosystem and define our own Processes and Entities etc? Or do we integrate somewhere on the model level? Im the first to admit that I could be misunderstanding many things here. But I hope you like to educate me here :). Regards, Bolke
