Madhan Neethiraj created ATLAS-4985:
---------------------------------------

             Summary: Option to ignore duplicate spark_process entities
                 Key: ATLAS-4985
                 URL: https://issues.apache.org/jira/browse/ATLAS-4985
             Project: Atlas
          Issue Type: Improvement
          Components: spark-integration
            Reporter: Madhan Neethiraj


A {{spark_process}} entity is created by Spark Atlas Connector for every Spark 
SQL execution, to capture the lineage from the executed SQL. This results in 
multiple instances of the same lineage to be recorded in Atlas - one for every 
execution of the SQL. For example, consider a query that reads from 2 tables 
and inserts into another table. Execution of this query n times will generate 
10 different {{spark_process}} entities and lineage between 3 tables referenced.

Atlas should support an option (configuration shown below) to ignore duplicate 
lineages reported from Spark Atlas Connector, for example by checking for 
existing {{spark_process}} entities having the same sets of input and output 
data source entities.

- {{atlas.notification.consumer.preprocess.spark_process.ignore-duplicates}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to