[Spark SQL] Data objects from query history

Ruben Mennes Fri, 30 Jun 2023 05:39:13 -0700

Dear Apache Spark community,

I hope this email finds you well. My name is Ruben, and I am an enthusiastic 
user of Apache Spark, specifically through the Databricks platform. I am 
reaching out to you today to seek your assistance and guidance regarding a 
specific use case.


I have been exploring the capabilities of Spark SQL and Databricks, and I have 
encountered a challenge related to accessing the data objects used by queries 
from the query history. I am aware that Databricks provides a comprehensive 
query history that contains valuable information about executed queries.

However, my objective is to extract the underlying data objects (tables) 
involved in each query. By doing so, I aim to analyze and understand the 
dependencies between queries and the data they operate on. This information 
will provide us new insights in how data is used across our data platform.

I have attempted to leverage the Spark SQL Antlr grammar, available at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
 to parse the queries retrieved from the query history. Unfortunately, I have 
encountered difficulties when parsing more complex queries.

As an example, I have struggled to parse queries with intricate constructs such 
as the following:
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
I would greatly appreciate it if you could provide me with some guidance on how 
to overcome these challenges. Specifically, I am interested in understanding if 
there are alternative approaches or existing tools that can help me achieve my 
goal of extracting the data objects used by queries from the Databricks query 
history.

Additionally, if there are any resources, documentation, or examples that 
provide further clarity on this topic, I would be more than grateful to receive 
them. Any insights you can provide would be of immense help in advancing my 
understanding and enabling me to make the most of the Spark SQL and Databricks 
ecosystem.

Thank you very much for your time and support. I eagerly look forward to 
hearing from you and benefiting from your expertise.

Best regards,
Ruben Mennes

[Spark SQL] Data objects from query history

Reply via email to