[ 
https://issues.apache.org/jira/browse/ATLAS-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Chandel resolved ATLAS-5274.
------------------------------------
    Resolution: Fixed

> [Impala Hook] Self-referencing INSERT OVERWRITE produces impala_process with 
> empty outputs[], breaking lineage
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-5274
>                 URL: https://issues.apache.org/jira/browse/ATLAS-5274
>             Project: Atlas
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>            Reporter: Abhinav Chandel
>            Assignee: Abhinav Chandel
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Problem
> When a user executes a self-referencing DML query in Impala (i.e., the source
> and destination table are the same), the Atlas Impala hook creates an
> impala_process entity where outputs[] is empty. The target table is recorded
> only in inputs[], not outputs[]. This breaks the lineage graph for that
> operation — the table has inputToProcesses=1 but outputFromProcesses=0, so the
> data lifecycle cannot be tracked.
> Steps to Reproduce
> Run the following on an Impala cluster with the Atlas hook enabled:
> {code:sql}
> CREATE DATABASE IF NOT EXISTS atlas_test_self_only;
> CREATE TABLE IF NOT EXISTS atlas_test_self_only.target_self_ref (
> id INT,
> amount INT
> );
> INSERT INTO atlas_test_self_only.target_self_ref VALUES (1, 100), (2, 200);
> INSERT OVERWRITE TABLE atlas_test_self_only.target_self_ref
> SELECT id, cast(amount + 50 as int)
> FROM atlas_test_self_only.target_self_ref
> WHERE amount > 0;
> {code}
> Expected Behavior
> An impala_process entity is created in Atlas with:
>  * inputs: [target_self_ref] ← source table
>  * outputs: [target_self_ref] ← same table, as destination
> Actual Behavior
> An impala_process entity IS created in Atlas, but:
>  * inputs: [target_self_ref] ← correct
>  * outputs: [] ← EMPTY — target table missing
> Atlas entity observed
>  * typeName: impala_process
>  * inputs: ['c16fc913-3cbc-4d86-9c2a-7610b49e212b']
>  * outputs: []
>  
> Impact
>  * Lineage graph is broken for all self-referencing ETL patterns in Impala.
>  * target_self_ref.outputFromProcesses = 0 (should be 1).
>  * Users cannot track data transformation history for in-place update patterns
> such as incremental aggregation, SCD updates, and self-join enrichment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to