Madhan Neethiraj created ATLAS-4985:
---------------------------------------
Summary: Option to ignore duplicate spark_process entities
Key: ATLAS-4985
URL: https://issues.apache.org/jira/browse/ATLAS-4985
Project: Atlas
Issue Type: Improvement
Components: spark-integration
Reporter: Madhan Neethiraj
A {{spark_process}} entity is created by Spark Atlas Connector for every Spark
SQL execution, to capture the lineage from the executed SQL. This results in
multiple instances of the same lineage to be recorded in Atlas - one for every
execution of the SQL. For example, consider a query that reads from 2 tables
and inserts into another table. Execution of this query n times will generate
10 different {{spark_process}} entities and lineage between 3 tables referenced.
Atlas should support an option (configuration shown below) to ignore duplicate
lineages reported from Spark Atlas Connector, for example by checking for
existing {{spark_process}} entities having the same sets of input and output
data source entities.
- {{atlas.notification.consumer.preprocess.spark_process.ignore-duplicates}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)