[ 
https://issues.apache.org/jira/browse/IMPALA-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Yu Rao updated IMPALA-14741:
---------------------------------
    Description: 
Currently in Impala, we remove all the new lines in the given SQL statement in 
[https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L1338-L1341].
{code:java}
  // Redact the SQL stmt and update the query context
  string stmt = replace_all_copy(query_ctx->client_request.stmt, "\n", " ");
  Redact(&stmt);
  query_ctx->client_request.__set_redacted_stmt((const string) stmt);
{code}
 

Later on, when the front-end is preparing the column lineage graph in 
[ColumnLineageGraph#computeLineageGraph()|https://github.com/apache/impala/blob/6302cdadde0a6627761f267bb060f7947a20334a/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java#L576C15-L576C34],
 we use {{queryCtx.client_request.redacted_stmt}} to populate {{queryStr_}} of 
{{{}ColumnLineageGraph{}}}, which in turn could be  used to produce a log line 
in the log file specified by {{{}--lineage_event_log_dir{}}}, or could be sent 
to a {{QueryEventHook}} (e.g., the Atlas hook) for publishing the corresponding 
lineage event.
{code}
  private void init(Analyzer analyzer) {
    Preconditions.checkNotNull(analyzer);
    Preconditions.checkState(analyzer.isRootAnalyzer());
    TQueryCtx queryCtx = analyzer.getQueryCtx();
    if (queryCtx.client_request.isSetRedacted_stmt()) {
      queryStr_ = queryCtx.client_request.redacted_stmt;
    } else {
      queryStr_ = queryCtx.client_request.stmt;
    }
    ...
  }
{code}

 

However, some external {{{}QueryEventHook{}}}, e.g., the Atlas hook, needs to 
derive additional information about the lineage event, e.g., the type of the 
query. If the query text ({{{}queryStr_{}}}) sent to such a {{QueryEventHook}} 
consists of SQL comments, such information could not be correctly derived in 
that the given query text could not be correctly parsed or matched by the hook.

 

Therefore, we should produce a {{redacted_stmt}} without any SQL comments.

 

One way to observe what column lineage events are produced by Impala is to 
start the Impala service with the following on the command line. This will make 
Impala produce lineage event logs under {{{}/tmp/impala_test_lineage{}}}.
{code:java}
$IMPALA_HOME/bin/start-impala-cluster.py \
'--impalad_args=--lineage_event_log_dir=/tmp/impala_test_lineage'
{code}
 

We then execute the following in the impala-shell. Note that there is a new 
line in the end of the line of "{{-- do not execute}}".
{code:java}
create database lineage_test_db;
[localhost:21050] default> -- do not execute
create table lineage_test_db.foo (id int);
{code}
 

Then we could see the following lineage log event in a lineage log file. It 
could be seen the comment ("{{-- do not execute}}") in the SQL statement above 
was not removed even though the new line was. This makes {{queryText}} a string 
that could not be correctly parsed by a SQL parser and also makes it difficult 
for external hooks to derive additional information about the query.
{code:java}
{"queryText":"-- do not execute create table lineage_test_db.foo (id 
int)","queryId":"f243ce37c242c221:96af573f00000000","hash":"6ae25a9e77fb3a101b5a3e3140ff35b2","user":"fangyurao","timestamp":1770853598,"endTime":1770853598,"edges":[],"vertices":[]}
{code}

  was:
Currently in Impala, we remove all the new lines in the given SQL statement in 
[https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L1338-L1341].
{code:java}
  // Redact the SQL stmt and update the query context
  string stmt = replace_all_copy(query_ctx->client_request.stmt, "\n", " ");
  Redact(&stmt);
  query_ctx->client_request.__set_redacted_stmt((const string) stmt);
{code}
 

Later on, when the front-end is preparing the column lineage graph in 
[ColumnLineageGraph#computeLineageGraph()|https://github.com/apache/impala/blob/6302cdadde0a6627761f267bb060f7947a20334a/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java#L576C15-L576C34],
 we use {{queryCtx.client_request.redacted_stmt}} to populate {{queryStr_}} of 
{{{}ColumnLineageGraph{}}}, which in turn could be  used to produce a log line 
in the log file specified by {{{}--lineage_event_log_dir{}}}, or could be sent 
to a {{QueryEventHook}} (e.g., the Atlas hook) for publishing the corresponding 
lineage event.
{code}
  private void init(Analyzer analyzer) {
    Preconditions.checkNotNull(analyzer);
    Preconditions.checkState(analyzer.isRootAnalyzer());
    TQueryCtx queryCtx = analyzer.getQueryCtx();
    if (queryCtx.client_request.isSetRedacted_stmt()) {
      queryStr_ = queryCtx.client_request.redacted_stmt;
    } else {
      queryStr_ = queryCtx.client_request.stmt;
    }
    ...
  }
{code}

 

However, some external {{{}QueryEventHook{}}}, e.g., the Atlas hook, needs to 
derive additional information about the lineage event, e.g., the type of the 
query. If the query text ({{{}queryStr_{}}}) sent to such a {{QueryEventHook}} 
consists of SQL comments, such information could not be correctly derived in 
that the given query text could not be correctly parsed or matched by the hook.

 

Therefore, we should produce a {{redacted_stmt}} without any SQL comments.

 

One way to observe what column lineage events are produced by Impala is to 
start the Impala service with the following on the command line. This will make 
Impala produce lineage event logs under {{{}/tmp/impala_test_lineage{}}}.
{code:java}
$IMPALA_HOME/bin/start-impala-cluster.py \
'--impalad_args=--lineage_event_log_dir=/tmp/impala_test_lineage'
{code}
 

We then execute the following in the impala-shell. Note that there is a new 
line in the end of the line of "{{-- do not execute}}"
{code:java}
create database lineage_test_db;
[localhost:21050] default> -- do not execute
create table lineage_test_db.foo (id int);
{code}
 

Then we could see the following lineage log event in a lineage log file. It 
could be seen the comment ("{{-- do not execute}}") in the SQL statement above 
was not removed even though the new line was. This makes {{queryText}} a string 
that could not be correctly parsed by a SQL parser and also makes it difficult 
for external hooks to derive additional information about the query.
{code:java}
{"queryText":"-- do not execute create table lineage_test_db.foo (id 
int)","queryId":"f243ce37c242c221:96af573f00000000","hash":"6ae25a9e77fb3a101b5a3e3140ff35b2","user":"fangyurao","timestamp":1770853598,"endTime":1770853598,"edges":[],"vertices":[]}
{code}


> Remove the SQL comments from redacted_stmt of TClientRequest
> ------------------------------------------------------------
>
>                 Key: IMPALA-14741
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14741
>             Project: IMPALA
>          Issue Type: Task
>            Reporter: Fang-Yu Rao
>            Assignee: Fang-Yu Rao
>            Priority: Major
>
> Currently in Impala, we remove all the new lines in the given SQL statement 
> in 
> [https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L1338-L1341].
> {code:java}
>   // Redact the SQL stmt and update the query context
>   string stmt = replace_all_copy(query_ctx->client_request.stmt, "\n", " ");
>   Redact(&stmt);
>   query_ctx->client_request.__set_redacted_stmt((const string) stmt);
> {code}
>  
> Later on, when the front-end is preparing the column lineage graph in 
> [ColumnLineageGraph#computeLineageGraph()|https://github.com/apache/impala/blob/6302cdadde0a6627761f267bb060f7947a20334a/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java#L576C15-L576C34],
>  we use {{queryCtx.client_request.redacted_stmt}} to populate {{queryStr_}} 
> of {{{}ColumnLineageGraph{}}}, which in turn could be  used to produce a log 
> line in the log file specified by {{{}--lineage_event_log_dir{}}}, or could 
> be sent to a {{QueryEventHook}} (e.g., the Atlas hook) for publishing the 
> corresponding lineage event.
> {code}
>   private void init(Analyzer analyzer) {
>     Preconditions.checkNotNull(analyzer);
>     Preconditions.checkState(analyzer.isRootAnalyzer());
>     TQueryCtx queryCtx = analyzer.getQueryCtx();
>     if (queryCtx.client_request.isSetRedacted_stmt()) {
>       queryStr_ = queryCtx.client_request.redacted_stmt;
>     } else {
>       queryStr_ = queryCtx.client_request.stmt;
>     }
>     ...
>   }
> {code}
>  
> However, some external {{{}QueryEventHook{}}}, e.g., the Atlas hook, needs to 
> derive additional information about the lineage event, e.g., the type of the 
> query. If the query text ({{{}queryStr_{}}}) sent to such a 
> {{QueryEventHook}} consists of SQL comments, such information could not be 
> correctly derived in that the given query text could not be correctly parsed 
> or matched by the hook.
>  
> Therefore, we should produce a {{redacted_stmt}} without any SQL comments.
>  
> One way to observe what column lineage events are produced by Impala is to 
> start the Impala service with the following on the command line. This will 
> make Impala produce lineage event logs under {{{}/tmp/impala_test_lineage{}}}.
> {code:java}
> $IMPALA_HOME/bin/start-impala-cluster.py \
> '--impalad_args=--lineage_event_log_dir=/tmp/impala_test_lineage'
> {code}
>  
> We then execute the following in the impala-shell. Note that there is a new 
> line in the end of the line of "{{-- do not execute}}".
> {code:java}
> create database lineage_test_db;
> [localhost:21050] default> -- do not execute
> create table lineage_test_db.foo (id int);
> {code}
>  
> Then we could see the following lineage log event in a lineage log file. It 
> could be seen the comment ("{{-- do not execute}}") in the SQL statement 
> above was not removed even though the new line was. This makes {{queryText}} 
> a string that could not be correctly parsed by a SQL parser and also makes it 
> difficult for external hooks to derive additional information about the query.
> {code:java}
> {"queryText":"-- do not execute create table lineage_test_db.foo (id 
> int)","queryId":"f243ce37c242c221:96af573f00000000","hash":"6ae25a9e77fb3a101b5a3e3140ff35b2","user":"fangyurao","timestamp":1770853598,"endTime":1770853598,"edges":[],"vertices":[]}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to