[
https://issues.apache.org/jira/browse/ATLAS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
VINAYAK MARRAIYA updated ATLAS-5238:
------------------------------------
Description:
Lineage events generated by *Apache Impala* currently do not include explicit
information about the *operation type* of the executed query.
Example lineage event produced by Impala:
{code:java}
{
"queryText": "create table test_db_01.test_tbl_01 (id int)",
"queryId": "b44da06a10682ce9:286bd74300000000",
"hash": "7debad31b299d7cccdf78a67968eb39d",
"user": "[email protected]",
"timestamp": 1771622004,
"endTime": 1771622005,
"edges": [],
"vertices": []
} {code}
What Impala Provides
Impala emits lineage events that include information such as:
* {{queryText}}
* {{queryId}}
* execution timestamps
* lineage graph ({{{}edges{}}} and {{{}vertices{}}})
However, the event *does not include the operation type* (e.g., {{{}CREATE{}}},
{{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}).
h3. What Atlas Currently Needs to Do
When processing lineage events, *Apache Atlas* requires the *operation type* to
correctly interpret the query and construct lineage relationships.
Since Impala does not provide this information, the Atlas Impala integration
attempts to {*}derive the operation type from the {{queryText}}{*}. This is
implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) using regex-based
parsing logic in {{{}ImpalaOperationParser{}}}.
This approach is {*}not fully reliable{*}, as certain SQL constructs can break
the parsing logic. For example:
* SQL statements containing *single-line comments*
* variations in SQL formatting
* complex query structures
These cases may lead to incorrect or missing operation type detection.
h3. Possible Improvements
One option is to ensure that the {{queryText}} included in lineage events is
always a *valid SQL statement* (see IMPALA-14741). However, Atlas would still
need to infer the operation type.
A more robust approach would be for *Apache Impala* to include an *explicit
operation type field* in the lineage event payload. If this information is
provided directly, *Apache Atlas* can consume it without relying on fragile
regex-based parsing of the SQL text, improving the reliability of lineage
ingestion.
was:
Lineage events generated by Impala currently do not include explicit
information about the *operation type* of the executed query.
{code:java}
Example lineage event:{
"queryText": "create table test_db_01.test_tbl_01 (id int)",
"queryId": "b44da06a10682ce9:286bd74300000000",
"hash": "7debad31b299d7cccdf78a67968eb39d",
"user": "[email protected]",
"timestamp": 1771622004,
"endTime": 1771622005,
"edges": [],
"vertices": []
} {code}
When ingesting Impala lineage events, *Apache Atlas* requires the *operation
type* (e.g., {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}) to
correctly interpret the query and construct the appropriate lineage
relationships.
Since this information is not present in the lineage event, the Atlas Impala
integration currently attempts to {*}derive the operation type from the query
text{*}. This is implemented in the Atlas hook ({{{}ImpalaLineageHook{}}})
using regular expression parsing logic in {{{}ImpalaOperationParser{}}}.
However, this regex-based approach is not fully reliable and can fail in
certain cases. For example, SQL statements that contain *single-line comments
or other formatting variations* may prevent the parser from correctly
identifying the operation type.
One possible improvement is to ensure that the {{queryText}} included in the
lineage event is always a valid SQL statement (see IMPALA-14741). However, this
still requires Atlas to infer the operation type from the query text.
h3. Proposed Improvement
To improve reliability for downstream lineage consumers such as {*}Apache
Atlas{*}, Impala could include an *explicit operation type field* in the
lineage event payload. Providing this information directly would remove the
need for regex-based parsing in Atlas and ensure more accurate lineage
processing.
Once this information is available in the lineage event, the Atlas Impala hook
can be updated to {*}consume the provided operation type instead of deriving it
from the SQL text{*}.
> Add operation type to the lineage graph
> ---------------------------------------
>
> Key: ATLAS-5238
> URL: https://issues.apache.org/jira/browse/ATLAS-5238
> Project: Atlas
> Issue Type: Task
> Components: atlas-core
> Affects Versions: 3.0.0
> Reporter: VINAYAK MARRAIYA
> Assignee: VINAYAK MARRAIYA
> Priority: Major
>
> Lineage events generated by *Apache Impala* currently do not include explicit
> information about the *operation type* of the executed query.
> Example lineage event produced by Impala:
>
> {code:java}
> {
> "queryText": "create table test_db_01.test_tbl_01 (id int)",
> "queryId": "b44da06a10682ce9:286bd74300000000",
> "hash": "7debad31b299d7cccdf78a67968eb39d",
> "user": "[email protected]",
> "timestamp": 1771622004,
> "endTime": 1771622005,
> "edges": [],
> "vertices": []
> } {code}
>
> What Impala Provides
> Impala emits lineage events that include information such as:
> * {{queryText}}
> * {{queryId}}
> * execution timestamps
> * lineage graph ({{{}edges{}}} and {{{}vertices{}}})
> However, the event *does not include the operation type* (e.g.,
> {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}).
> h3. What Atlas Currently Needs to Do
> When processing lineage events, *Apache Atlas* requires the *operation type*
> to correctly interpret the query and construct lineage relationships.
> Since Impala does not provide this information, the Atlas Impala integration
> attempts to {*}derive the operation type from the {{queryText}}{*}. This is
> implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) using regex-based
> parsing logic in {{{}ImpalaOperationParser{}}}.
> This approach is {*}not fully reliable{*}, as certain SQL constructs can
> break the parsing logic. For example:
> * SQL statements containing *single-line comments*
> * variations in SQL formatting
> * complex query structures
> These cases may lead to incorrect or missing operation type detection.
> h3. Possible Improvements
> One option is to ensure that the {{queryText}} included in lineage events is
> always a *valid SQL statement* (see IMPALA-14741). However, Atlas would still
> need to infer the operation type.
> A more robust approach would be for *Apache Impala* to include an *explicit
> operation type field* in the lineage event payload. If this information is
> provided directly, *Apache Atlas* can consume it without relying on fragile
> regex-based parsing of the SQL text, improving the reliability of lineage
> ingestion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)