[ 
https://issues.apache.org/jira/browse/ATLAS-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Padashetty updated ATLAS-4746:
------------------------------------
    Affects Version/s: 2.3.0
                           (was: 2.1.0)

> hive_process and hive_process_execution (lineage) being generated for simple 
> DML UPDATE queries run via hive
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-4746
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4746
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 2.3.0
>            Reporter: Umesh Padashetty
>            Priority: Critical
>         Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot 
> 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, 
> Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 
> PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png
>
>
> Queries ran:
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc;
> insert into test_hive_lineage_4 values ('qwer', '2');
> update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
> As you can see, these are simple DML queries, and not DDL
> We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, 
> DELETE, and UPDATE){*} queries NOR should we be tracking the audits. 
> Jiras via which DML operations audits were skipped:
>  * https://issues.apache.org/jira/browse/ATLAS-3188
>  * https://issues.apache.org/jira/browse/ATLAS-3198
> But all the issues were related to audits and not the lineage. In all these 
> cases, lineage was not generated for the DML UPDATE query
> But observing that we are now capturing lineage for simple DML Update query
> Relationship after running
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
> !Screenshot 2023-05-02 at 6.28.18 PM.png!
> !Screenshot 2023-05-02 at 6.28.34 PM.png!
> As seen, there is no lineage generated. Good so far 
> Then I ran
> {code:java}
> insert into test_hive_lineage_4 values ('qwer', '2'); {code}
> No lineage was generated. Good so far 
> !Screenshot 2023-05-02 at 6.47.19 PM.png!
> Then I ran 
> {code:java}
> update test_hive_lineage_4 set name = 'vwxy' where id = 2;  {code}
> This immediately generated a hive_process and a hive_process_execution
> Interestingly, hive_process with the following name was generated. As you can 
> see, it has DELETE in the process name, when in reality this was an UPDATE 
> DML. Another cause of concern?
> {code:java}
> QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000
>  {code}
> !Screenshot 2023-05-02 at 6.29.05 PM.png!
>   !Screenshot 2023-05-02 at 6.29.10 PM.png!
> I then ran the same update query 100+ times, it created 100+ UNIQUE 
> (timestamp delimited) hive_process_executions
> !Screenshot 2023-05-02 at 6.50.16 PM.png!
> This is a disaster since every UPDATE query now generates a 
> process_execution. 
> Customers can run 1000s of update queries, which are mostly of no use for 
> atlas, but this issue is now leading to the generation of 1000s of 
> process_executions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to