[ https://issues.apache.org/jira/browse/ATLAS-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Umesh Padashetty updated ATLAS-4746: ------------------------------------ Affects Version/s: 2.3.0 (was: 2.1.0) > hive_process and hive_process_execution (lineage) being generated for simple > DML UPDATE queries run via hive > ------------------------------------------------------------------------------------------------------------ > > Key: ATLAS-4746 > URL: https://issues.apache.org/jira/browse/ATLAS-4746 > Project: Atlas > Issue Type: Bug > Components: atlas-core > Affects Versions: 2.3.0 > Reporter: Umesh Padashetty > Priority: Critical > Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot > 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, > Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 > PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png > > > Queries ran: > {code:java} > create table test_hive_lineage_4 (name string, id int) stored as orc; > insert into test_hive_lineage_4 values ('qwer', '2'); > update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code} > As you can see, these are simple DML queries, and not DDL > We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, > DELETE, and UPDATE){*} queries NOR should we be tracking the audits. > Jiras via which DML operations audits were skipped: > * https://issues.apache.org/jira/browse/ATLAS-3188 > * https://issues.apache.org/jira/browse/ATLAS-3198 > But all the issues were related to audits and not the lineage. In all these > cases, lineage was not generated for the DML UPDATE query > But observing that we are now capturing lineage for simple DML Update query > Relationship after running > {code:java} > create table test_hive_lineage_4 (name string, id int) stored as orc; {code} > !Screenshot 2023-05-02 at 6.28.18 PM.png! > !Screenshot 2023-05-02 at 6.28.34 PM.png! > As seen, there is no lineage generated. Good so far > Then I ran > {code:java} > insert into test_hive_lineage_4 values ('qwer', '2'); {code} > No lineage was generated. Good so far > !Screenshot 2023-05-02 at 6.47.19 PM.png! > Then I ran > {code:java} > update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code} > This immediately generated a hive_process and a hive_process_execution > Interestingly, hive_process with the following name was generated. As you can > see, it has DELETE in the process name, when in reality this was an UPDATE > DML. Another cause of concern? > {code:java} > QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000 > {code} > !Screenshot 2023-05-02 at 6.29.05 PM.png! > !Screenshot 2023-05-02 at 6.29.10 PM.png! > I then ran the same update query 100+ times, it created 100+ UNIQUE > (timestamp delimited) hive_process_executions > !Screenshot 2023-05-02 at 6.50.16 PM.png! > This is a disaster since every UPDATE query now generates a > process_execution. > Customers can run 1000s of update queries, which are mostly of no use for > atlas, but this issue is now leading to the generation of 1000s of > process_executions -- This message was sent by Atlassian Jira (v8.20.10#820010)