Eric Sun created SPARK-50303:
--------------------------------
Summary: Enable QUERY_TAG for SQL Session in Spark SQL
Key: SPARK-50303
URL: https://issues.apache.org/jira/browse/SPARK-50303
Project: Spark
Issue Type: Wish
Components: SQL
Affects Versions: 3.5.3, 4.0.0
Reporter: Eric Sun
As Spark SQL becomes more powerful for both analytics and ELT (with big T), we
see more tools are generating and executing SQL to transform data.
*Session* is a very important mechanism for lineage and usage/cost tracking,
especially for the multi-statement ELT cases. *Tagging* a
{color:#ff0000}series{color} of query statements with the higher level business
*context* (such as project, flow_name, job_name, batch_id, start_data_dt,
end_data_dt, owner, cost_group, ...) can provide tremendous observability
improvement without much overhead. It is not efficient to collect and analyze
the scattered query UUID and try to group them together to reconstruct the
SESSION. But it is quite easy to allow the SQL client to set the tags when the
session is established.
* Presto has *Session Properties*
* Trino has {*}X-Trino-Session{*}, *X-Trino-Client-Info* and
*X-Trino-Client-Tags* to carry a list of K/V
* Snowflake has *QUERY_TAG* to make observability much easier and efficient
* Redshift supports tagging for query as well
It will be great that Spark SQL can set a paved path/recipe for the
workload/cost analysis/observability based on the session QUERY_TAG, so that
the whole community can follow instead reinventing the wheel.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]