[ 
https://issues.apache.org/jira/browse/ATLAS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

VINAYAK MARRAIYA updated ATLAS-5238:
------------------------------------
    Description: 
Lineage events generated by Impala currently do not include explicit 
information about the *operation type* of the executed query.

 
{code:java}
Example lineage event:{

"queryText": "create table test_db_01.test_tbl_01 (id int)",
"queryId": "b44da06a10682ce9:286bd74300000000",
"hash": "7debad31b299d7cccdf78a67968eb39d",
"user": "[email protected]",
"timestamp": 1771622004,
"endTime": 1771622005,
"edges": [],
"vertices": []
}           {code}
When ingesting Impala lineage events, *Apache Atlas* requires the *operation 
type* (e.g., {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}) to 
correctly interpret the query and construct the appropriate lineage 
relationships.

Since this information is not present in the lineage event, the Atlas Impala 
integration currently attempts to {*}derive the operation type from the query 
text{*}. This is implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) 
using regular expression parsing logic in {{{}ImpalaOperationParser{}}}.

However, this regex-based approach is not fully reliable and can fail in 
certain cases. For example, SQL statements that contain *single-line comments 
or other formatting variations* may prevent the parser from correctly 
identifying the operation type.

One possible improvement is to ensure that the {{queryText}} included in the 
lineage event is always a valid SQL statement (see IMPALA-14741). However, this 
still requires Atlas to infer the operation type from the query text.
h3. Proposed Improvement

To improve reliability for downstream lineage consumers such as {*}Apache 
Atlas{*}, Impala could include an *explicit operation type field* in the 
lineage event payload. Providing this information directly would remove the 
need for regex-based parsing in Atlas and ensure more accurate lineage 
processing.

Once this information is available in the lineage event, the Atlas Impala hook 
can be updated to {*}consume the provided operation type instead of deriving it 
from the SQL text{*}.

> Add operation type to the lineage graph
> ---------------------------------------
>
>                 Key: ATLAS-5238
>                 URL: https://issues.apache.org/jira/browse/ATLAS-5238
>             Project: Atlas
>          Issue Type: Task
>          Components:  atlas-core
>    Affects Versions: 3.0.0
>            Reporter: VINAYAK MARRAIYA
>            Assignee: VINAYAK MARRAIYA
>            Priority: Major
>
> Lineage events generated by Impala currently do not include explicit 
> information about the *operation type* of the executed query.
>  
> {code:java}
> Example lineage event:{
> "queryText": "create table test_db_01.test_tbl_01 (id int)",
> "queryId": "b44da06a10682ce9:286bd74300000000",
> "hash": "7debad31b299d7cccdf78a67968eb39d",
> "user": "[email protected]",
> "timestamp": 1771622004,
> "endTime": 1771622005,
> "edges": [],
> "vertices": []
> }           {code}
> When ingesting Impala lineage events, *Apache Atlas* requires the *operation 
> type* (e.g., {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}) 
> to correctly interpret the query and construct the appropriate lineage 
> relationships.
> Since this information is not present in the lineage event, the Atlas Impala 
> integration currently attempts to {*}derive the operation type from the query 
> text{*}. This is implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) 
> using regular expression parsing logic in {{{}ImpalaOperationParser{}}}.
> However, this regex-based approach is not fully reliable and can fail in 
> certain cases. For example, SQL statements that contain *single-line comments 
> or other formatting variations* may prevent the parser from correctly 
> identifying the operation type.
> One possible improvement is to ensure that the {{queryText}} included in the 
> lineage event is always a valid SQL statement (see IMPALA-14741). However, 
> this still requires Atlas to infer the operation type from the query text.
> h3. Proposed Improvement
> To improve reliability for downstream lineage consumers such as {*}Apache 
> Atlas{*}, Impala could include an *explicit operation type field* in the 
> lineage event payload. Providing this information directly would remove the 
> need for regex-based parsing in Atlas and ensure more accurate lineage 
> processing.
> Once this information is available in the lineage event, the Atlas Impala 
> hook can be updated to {*}consume the provided operation type instead of 
> deriving it from the SQL text{*}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to