Dear Apache Spark community, I hope this email finds you well. My name is Ruben, and I am an enthusiastic user of Apache Spark, specifically through the Databricks platform. I am reaching out to you today to seek your assistance and guidance regarding a specific use case.
I have been exploring the capabilities of Spark SQL and Databricks, and I have encountered a challenge related to accessing the data objects used by queries from the query history. I am aware that Databricks provides a comprehensive query history that contains valuable information about executed queries. However, my objective is to extract the underlying data objects (tables) involved in each query. By doing so, I aim to analyze and understand the dependencies between queries and the data they operate on. This information will provide us new insights in how data is used across our data platform. I have attempted to leverage the Spark SQL Antlr grammar, available at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4, to parse the queries retrieved from the query history. Unfortunately, I have encountered difficulties when parsing more complex queries. As an example, I have struggled to parse queries with intricate constructs such as the following: > SELECT > concat(pickup_zip, '-', dropoff_zip) as route, > AVG(fare_amount) as average_fare > FROM > `samples`.`nyctaxi`.`trips` > GROUP BY > 1 > ORDER BY > 2 DESC > LIMIT 1000 I would greatly appreciate it if you could provide me with some guidance on how to overcome these challenges. Specifically, I am interested in understanding if there are alternative approaches or existing tools that can help me achieve my goal of extracting the data objects used by queries from the Databricks query history. Additionally, if there are any resources, documentation, or examples that provide further clarity on this topic, I would be more than grateful to receive them. Any insights you can provide would be of immense help in advancing my understanding and enabling me to make the most of the Spark SQL and Databricks ecosystem. Thank you very much for your time and support. I eagerly look forward to hearing from you and benefiting from your expertise. Best regards, Ruben Mennes