[
https://issues.apache.org/jira/browse/ATLAS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
VINAYAK MARRAIYA updated ATLAS-5238:
------------------------------------
Description:
Lineage events generated by Impala currently do not include explicit
information about the *operation type* of the executed query.
{code:java}
Example lineage event:{
"queryText": "create table test_db_01.test_tbl_01 (id int)",
"queryId": "b44da06a10682ce9:286bd74300000000",
"hash": "7debad31b299d7cccdf78a67968eb39d",
"user": "[email protected]",
"timestamp": 1771622004,
"endTime": 1771622005,
"edges": [],
"vertices": []
} {code}
When ingesting Impala lineage events, *Apache Atlas* requires the *operation
type* (e.g., {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}) to
correctly interpret the query and construct the appropriate lineage
relationships.
Since this information is not present in the lineage event, the Atlas Impala
integration currently attempts to {*}derive the operation type from the query
text{*}. This is implemented in the Atlas hook ({{{}ImpalaLineageHook{}}})
using regular expression parsing logic in {{{}ImpalaOperationParser{}}}.
However, this regex-based approach is not fully reliable and can fail in
certain cases. For example, SQL statements that contain *single-line comments
or other formatting variations* may prevent the parser from correctly
identifying the operation type.
One possible improvement is to ensure that the {{queryText}} included in the
lineage event is always a valid SQL statement (see IMPALA-14741). However, this
still requires Atlas to infer the operation type from the query text.
h3. Proposed Improvement
To improve reliability for downstream lineage consumers such as {*}Apache
Atlas{*}, Impala could include an *explicit operation type field* in the
lineage event payload. Providing this information directly would remove the
need for regex-based parsing in Atlas and ensure more accurate lineage
processing.
Once this information is available in the lineage event, the Atlas Impala hook
can be updated to {*}consume the provided operation type instead of deriving it
from the SQL text{*}.
> Add operation type to the lineage graph
> ---------------------------------------
>
> Key: ATLAS-5238
> URL: https://issues.apache.org/jira/browse/ATLAS-5238
> Project: Atlas
> Issue Type: Task
> Components: atlas-core
> Affects Versions: 3.0.0
> Reporter: VINAYAK MARRAIYA
> Assignee: VINAYAK MARRAIYA
> Priority: Major
>
> Lineage events generated by Impala currently do not include explicit
> information about the *operation type* of the executed query.
>
> {code:java}
> Example lineage event:{
> "queryText": "create table test_db_01.test_tbl_01 (id int)",
> "queryId": "b44da06a10682ce9:286bd74300000000",
> "hash": "7debad31b299d7cccdf78a67968eb39d",
> "user": "[email protected]",
> "timestamp": 1771622004,
> "endTime": 1771622005,
> "edges": [],
> "vertices": []
> } {code}
> When ingesting Impala lineage events, *Apache Atlas* requires the *operation
> type* (e.g., {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}})
> to correctly interpret the query and construct the appropriate lineage
> relationships.
> Since this information is not present in the lineage event, the Atlas Impala
> integration currently attempts to {*}derive the operation type from the query
> text{*}. This is implemented in the Atlas hook ({{{}ImpalaLineageHook{}}})
> using regular expression parsing logic in {{{}ImpalaOperationParser{}}}.
> However, this regex-based approach is not fully reliable and can fail in
> certain cases. For example, SQL statements that contain *single-line comments
> or other formatting variations* may prevent the parser from correctly
> identifying the operation type.
> One possible improvement is to ensure that the {{queryText}} included in the
> lineage event is always a valid SQL statement (see IMPALA-14741). However,
> this still requires Atlas to infer the operation type from the query text.
> h3. Proposed Improvement
> To improve reliability for downstream lineage consumers such as {*}Apache
> Atlas{*}, Impala could include an *explicit operation type field* in the
> lineage event payload. Providing this information directly would remove the
> need for regex-based parsing in Atlas and ensure more accurate lineage
> processing.
> Once this information is available in the lineage event, the Atlas Impala
> hook can be updated to {*}consume the provided operation type instead of
> deriving it from the SQL text{*}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)