As far as I know, Atlas entries can be created with a rest call. Can we not create an abstracted Flink operator that makes the rest call on job execution/submission?
Regards, Taher Koitawala On Wed, Feb 5, 2020, 10:16 PM Flavio Pompermaier <pomperma...@okkam.it> wrote: > Hi Gyula, > thanks for taking care of integrating Flink with Atlas (and Egeria > initiative in the end) that is IMHO the most important part of all the > Hadoop ecosystem and that, unfortunately, was quite overlooked. I can > confirm that the integration with Atlas/Egeria is absolutely of big > interest. > > Il Mer 5 Feb 2020, 17:12 Till Rohrmann <trohrm...@apache.org> ha scritto: > > > Hi Gyula, > > > > thanks for starting this discussion. Before diving in the details of how > to > > implement this feature, I wanted to ask whether it is strictly required > > that the Atlas integration lives within Flink or not? Could it also work > if > > you have tool which receives job submissions, extracts the required > > information, forwards the job submission to Flink, monitors the execution > > result and finally publishes some information to Atlas (modulo some other > > steps which are missing in my description)? Having a different layer > being > > responsible for this would keep complexity out of Flink. > > > > Cheers, > > Till > > > > On Wed, Feb 5, 2020 at 12:48 PM Gyula Fóra <gyf...@apache.org> wrote: > > > > > Hi all! > > > > > > We have started some preliminary work on the Flink - Atlas integration > at > > > Cloudera. It seems that the integration will require some new hook > > > interfaces at the jobgraph generation and submission phases, so I > > figured I > > > will open a discussion thread with my initial ideas to get some early > > > feedback. > > > > > > *Minimal background* > > > Very simply put Apache Atlas is a data governance framework that stores > > > metadata for our data and processing logic to track ownership, lineage > > etc. > > > It is already integrated with systems like HDFS, Kafka, Hive and many > > > others. > > > > > > Adding Flink integration would mean that we can track the input output > > data > > > of our Flink jobs, their owners and how different Flink jobs are > > connected > > > to each other through the data they produce (lineage). This seems to > be a > > > very big deal for a lot of companies :) > > > > > > *Flink - Atlas integration in a nutshell* > > > In order to integrate with Atlas we basically need 2 things. > > > - Flink entity definitions > > > - Flink Atlas hook > > > > > > The entity definition is the easy part. It is a json that contains the > > > objects (entities) that we want to store for any give Flink job. As a > > > starter we could have a single FlinkApplication entity that has a set > of > > > inputs and outputs. These inputs/outputs are other Atlas entities that > > are > > > already defines such as Kafka topic or Hbase table. > > > > > > The Flink atlas hook will be the logic that creates the entity instance > > and > > > uploads it to Atlas when we start a new Flink job. This is the part > where > > > we implement the core logic. > > > > > > *Job submission hook* > > > In order to implement the Atlas hook we need a place where we can > inspect > > > the pipeline, create and send the metadata when the job starts. When we > > > create the FlinkApplication entity we need to be able to easily > determine > > > the sources and sinks (and their properties) of the pipeline. > > > > > > Unfortunately there is no JobSubmission hook in Flink that could > execute > > > this logic and even if there was one there is a mismatch of abstraction > > > levels needed to implement the integration. > > > We could imagine a JobSubmission hook executed in the JobManager runner > > as > > > this: > > > > > > void onSuccessfulSubmission(JobGraph jobGraph, Configuration > > > configuration); > > > > > > This is nice but the JobGraph makes it super difficult to extract > sources > > > and UDFs to create the metadata entity. The atlas entity however could > be > > > easily created from the StreamGraph object (used to represent the > logical > > > flow) before the JobGraph is generated. To go around this limitation we > > > could add a JobGraphGeneratorHook interface: > > > > > > void preProcess(StreamGraph streamGraph); void postProcess(JobGraph > > > jobGraph); > > > > > > We could then generate the atlas entity in the preprocess step and add > a > > > jobmission hook in the postprocess step that will simply send the > already > > > baked in entity. > > > > > > *This kinda works but...* > > > The approach outlined above seems to work and we have built a POC using > > it. > > > Unfortunately it is far from nice as it exposes non-public APIs such as > > the > > > StreamGraph. Also it feels a bit weird to have 2 hooks instead of one. > > > > > > It would be much nicer if we could somehow go back from JobGraph to > > > StreamGraph or at least have an easy way to access source/sink UDFS. > > > > > > What do you think? > > > > > > Cheers, > > > Gyula > > > > > >