Eric Sun created SPARK-50303:
--------------------------------

             Summary: Enable QUERY_TAG for SQL Session in Spark SQL
                 Key: SPARK-50303
                 URL: https://issues.apache.org/jira/browse/SPARK-50303
             Project: Spark
          Issue Type: Wish
          Components: SQL
    Affects Versions: 3.5.3, 4.0.0
            Reporter: Eric Sun


As Spark SQL becomes more powerful for both analytics and ELT (with big T), we 
see more tools are generating and executing SQL to transform data.

*Session* is a very important mechanism for lineage and usage/cost tracking, 
especially for the multi-statement ELT cases. *Tagging* a 
{color:#ff0000}series{color} of query statements with the higher level business 
*context* (such as project, flow_name, job_name, batch_id, start_data_dt, 
end_data_dt, owner, cost_group, ...) can provide tremendous observability 
improvement without much overhead. It is not efficient to collect and analyze 
the scattered query UUID and try to group them together to reconstruct the 
SESSION. But it is quite easy to allow the SQL client to set the tags when the 
session is established.
 * Presto has *Session Properties*
 * Trino has {*}X-Trino-Session{*}, *X-Trino-Client-Info* and 
*X-Trino-Client-Tags* to carry a list of K/V
 * Snowflake has *QUERY_TAG* to make observability much easier and efficient
 * Redshift supports tagging for query as well

It will be great that Spark SQL can set a paved path/recipe for the 
workload/cost analysis/observability based on the session QUERY_TAG, so that 
the whole community can follow instead reinventing the wheel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to