[
https://issues.apache.org/jira/browse/ATLAS-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhinav Chandel resolved ATLAS-5274.
------------------------------------
Resolution: Fixed
> [Impala Hook] Self-referencing INSERT OVERWRITE produces impala_process with
> empty outputs[], breaking lineage
> --------------------------------------------------------------------------------------------------------------
>
> Key: ATLAS-5274
> URL: https://issues.apache.org/jira/browse/ATLAS-5274
> Project: Atlas
> Issue Type: Bug
> Affects Versions: 2.5.0
> Reporter: Abhinav Chandel
> Assignee: Abhinav Chandel
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Problem
> When a user executes a self-referencing DML query in Impala (i.e., the source
> and destination table are the same), the Atlas Impala hook creates an
> impala_process entity where outputs[] is empty. The target table is recorded
> only in inputs[], not outputs[]. This breaks the lineage graph for that
> operation — the table has inputToProcesses=1 but outputFromProcesses=0, so the
> data lifecycle cannot be tracked.
> Steps to Reproduce
> Run the following on an Impala cluster with the Atlas hook enabled:
> {code:sql}
> CREATE DATABASE IF NOT EXISTS atlas_test_self_only;
> CREATE TABLE IF NOT EXISTS atlas_test_self_only.target_self_ref (
> id INT,
> amount INT
> );
> INSERT INTO atlas_test_self_only.target_self_ref VALUES (1, 100), (2, 200);
> INSERT OVERWRITE TABLE atlas_test_self_only.target_self_ref
> SELECT id, cast(amount + 50 as int)
> FROM atlas_test_self_only.target_self_ref
> WHERE amount > 0;
> {code}
> Expected Behavior
> An impala_process entity is created in Atlas with:
> * inputs: [target_self_ref] ← source table
> * outputs: [target_self_ref] ← same table, as destination
> Actual Behavior
> An impala_process entity IS created in Atlas, but:
> * inputs: [target_self_ref] ← correct
> * outputs: [] ← EMPTY — target table missing
> Atlas entity observed
> * typeName: impala_process
> * inputs: ['c16fc913-3cbc-4d86-9c2a-7610b49e212b']
> * outputs: []
>
> Impact
> * Lineage graph is broken for all self-referencing ETL patterns in Impala.
> * target_self_ref.outputFromProcesses = 0 (should be 1).
> * Users cannot track data transformation history for in-place update patterns
> such as incremental aggregation, SCD updates, and self-join enrichment.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)