Abhinav Chandel created ATLAS-5274:
--------------------------------------

             Summary: [Impala Hook] Self-referencing INSERT OVERWRITE produces 
impala_process with empty outputs[], breaking lineage
                 Key: ATLAS-5274
                 URL: https://issues.apache.org/jira/browse/ATLAS-5274
             Project: Atlas
          Issue Type: Bug
    Affects Versions: 2.5.0
            Reporter: Abhinav Chandel
            Assignee: Abhinav Chandel


Problem

When a user executes a self-referencing DML query in Impala (i.e., the source
and destination table are the same), the Atlas Impala hook creates an
impala_process entity where outputs[] is empty. The target table is recorded
only in inputs[], not outputs[]. This breaks the lineage graph for that
operation — the table has inputToProcesses=1 but outputFromProcesses=0, so the
data lifecycle cannot be tracked.

Steps to Reproduce

Run the following on an Impala cluster with the Atlas hook enabled:

{code:sql}
CREATE DATABASE IF NOT EXISTS atlas_test_self_only;

CREATE TABLE IF NOT EXISTS atlas_test_self_only.target_self_ref (
id INT,
amount INT
);

INSERT INTO atlas_test_self_only.target_self_ref VALUES (1, 100), (2, 200);

INSERT OVERWRITE TABLE atlas_test_self_only.target_self_ref
SELECT id, cast(amount + 50 as int)
FROM atlas_test_self_only.target_self_ref
WHERE amount > 0;
{code}

Expected Behavior

An impala_process entity is created in Atlas with:
 * inputs: [target_self_ref] ← source table

 * outputs: [target_self_ref] ← same table, as destination

Actual Behavior

An impala_process entity IS created in Atlas, but:
 * inputs: [target_self_ref] ← correct

 * outputs: [] ← EMPTY — target table missing

Atlas entity observed
 * typeName: impala_process

 * inputs: ['c16fc913-3cbc-4d86-9c2a-7610b49e212b']

 * outputs: []

 

Impact
 * Lineage graph is broken for all self-referencing ETL patterns in Impala.

 * target_self_ref.outputFromProcesses = 0 (should be 1).

 * Users cannot track data transformation history for in-place update patterns
such as incremental aggregation, SCD updates, and self-join enrichment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to