Fang-Yu Rao created IMPALA-14741:
------------------------------------

             Summary: Remove the SQL comments from redacted_stmt of 
TClientRequest
                 Key: IMPALA-14741
                 URL: https://issues.apache.org/jira/browse/IMPALA-14741
             Project: IMPALA
          Issue Type: Task
            Reporter: Fang-Yu Rao
            Assignee: Fang-Yu Rao


Currently in Impala, we remove all the new lines in the given SQL statement in 
[https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L1338-L1341].
{code:java}
  // Redact the SQL stmt and update the query context
  string stmt = replace_all_copy(query_ctx->client_request.stmt, "\n", " ");
  Redact(&stmt);
  query_ctx->client_request.__set_redacted_stmt((const string) stmt);
{code}
 

Later on, when the front-end is preparing the column lineage graph in 
[ColumnLineageGraph#computeLineageGraph()|https://github.com/apache/impala/blob/6302cdadde0a6627761f267bb060f7947a20334a/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java#L576C15-L576C34],
 we use {{queryCtx.client_request.redacted_stmt}} to populate {{queryStr_}} of 
{{{}ColumnLineageGraph{}}}, which in turn is used to produce a log line in the 
log file specified by {{{}--lineage_event_log_dir{}}}, or is sent to a 
{{QueryEventHook}} (e.g., the Atlas hook) for publishing the corresponding 
lineage event.

 

However, some external {{{}QueryEventHook{}}}, e.g., the Atlas hook, needs to 
derive additional information about the lineage event, e.g., the type of the 
query. If the query text ({{{}queryStr_{}}}) sent to such a {{QueryEventHook}} 
consists of SQL comments, such information could not be correctly derived in 
that the given query text could not be correctly parsed or matched by the hook.

 

Therefore, we should produce a {{redacted_stmt}} without any SQL comments.

 

One way to observe what column lineage events are produced by Impala is to 
start the Impala service with the following on the command line. This will make 
Impala produce lineage event logs under {{{}/tmp/impala_test_lineage{}}}.
{code:java}
$IMPALA_HOME/bin/start-impala-cluster.py \
'--impalad_args=--lineage_event_log_dir=/tmp/impala_test_lineage'
{code}
 

We then execute the following in the impala-shell.
{code:java}
create database lineage_test_db;
[localhost:21050] default> -- do not execute <= note that there is a new line 
in the end of this line.
create table lineage_test_db.foo (id int);
{code}
 

Then we could see the following lineage log event in a lineage log file. It 
could be seen the comment ("{{-- do not execute}}") in the SQL statement above 
was not removed even though the new line was. This makes {{queryText}} a string 
that could not be correctly parsed by a SQL parser and also makes it difficult 
for external hooks to derive additional information about the query.
{code:java}
{"queryText":"-- do not execute create table lineage_test_db.foo (id 
int)","queryId":"f243ce37c242c221:96af573f00000000","hash":"6ae25a9e77fb3a101b5a3e3140ff35b2","user":"fangyurao","timestamp":1770853598,"endTime":1770853598,"edges":[],"vertices":[]}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to