Fang-Yu Rao created IMPALA-14741:
------------------------------------
Summary: Remove the SQL comments from redacted_stmt of
TClientRequest
Key: IMPALA-14741
URL: https://issues.apache.org/jira/browse/IMPALA-14741
Project: IMPALA
Issue Type: Task
Reporter: Fang-Yu Rao
Assignee: Fang-Yu Rao
Currently in Impala, we remove all the new lines in the given SQL statement in
[https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L1338-L1341].
{code:java}
// Redact the SQL stmt and update the query context
string stmt = replace_all_copy(query_ctx->client_request.stmt, "\n", " ");
Redact(&stmt);
query_ctx->client_request.__set_redacted_stmt((const string) stmt);
{code}
Later on, when the front-end is preparing the column lineage graph in
[ColumnLineageGraph#computeLineageGraph()|https://github.com/apache/impala/blob/6302cdadde0a6627761f267bb060f7947a20334a/fe/src/main/java/org/apache/impala/analysis/ColumnLineageGraph.java#L576C15-L576C34],
we use {{queryCtx.client_request.redacted_stmt}} to populate {{queryStr_}} of
{{{}ColumnLineageGraph{}}}, which in turn is used to produce a log line in the
log file specified by {{{}--lineage_event_log_dir{}}}, or is sent to a
{{QueryEventHook}} (e.g., the Atlas hook) for publishing the corresponding
lineage event.
However, some external {{{}QueryEventHook{}}}, e.g., the Atlas hook, needs to
derive additional information about the lineage event, e.g., the type of the
query. If the query text ({{{}queryStr_{}}}) sent to such a {{QueryEventHook}}
consists of SQL comments, such information could not be correctly derived in
that the given query text could not be correctly parsed or matched by the hook.
Therefore, we should produce a {{redacted_stmt}} without any SQL comments.
One way to observe what column lineage events are produced by Impala is to
start the Impala service with the following on the command line. This will make
Impala produce lineage event logs under {{{}/tmp/impala_test_lineage{}}}.
{code:java}
$IMPALA_HOME/bin/start-impala-cluster.py \
'--impalad_args=--lineage_event_log_dir=/tmp/impala_test_lineage'
{code}
We then execute the following in the impala-shell.
{code:java}
create database lineage_test_db;
[localhost:21050] default> -- do not execute <= note that there is a new line
in the end of this line.
create table lineage_test_db.foo (id int);
{code}
Then we could see the following lineage log event in a lineage log file. It
could be seen the comment ("{{-- do not execute}}") in the SQL statement above
was not removed even though the new line was. This makes {{queryText}} a string
that could not be correctly parsed by a SQL parser and also makes it difficult
for external hooks to derive additional information about the query.
{code:java}
{"queryText":"-- do not execute create table lineage_test_db.foo (id
int)","queryId":"f243ce37c242c221:96af573f00000000","hash":"6ae25a9e77fb3a101b5a3e3140ff35b2","user":"fangyurao","timestamp":1770853598,"endTime":1770853598,"edges":[],"vertices":[]}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)