Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
 Hi Ruben,

I’m not sure if this answers your question, but if you’re interested in
exploring the underlying tables, you could always try something like the
below in a Databricks notebook:

display(spark.read.table(’samples.nyctaxi.trips’))

(For vanilla Spark users, it would be
spark.read.table(’samples.nyctaxi.trips’).show(100, False) )

Since you’re using Databricks, you can also find the data under the Data
menu, scroll down to the samples metastore then click through to trips to
find the file location, schema, and sample data.

On Jun 29, 2023 at 23:53:25, Ruben Mennes  wrote:

> Dear Apache Spark community,
>
> I hope this email finds you well. My name is Ruben, and I am an
> enthusiastic user of Apache Spark, specifically through the Databricks
> platform. I am reaching out to you today to seek your assistance and
> guidance regarding a specific use case.
>
> I have been exploring the capabilities of Spark SQL and Databricks, and I
> have encountered a challenge related to accessing the data objects used by
> queries from the query history. I am aware that Databricks provides a
> comprehensive query history that contains valuable information about
> executed queries.
>
> However, my objective is to extract the underlying data objects (tables)
> involved in each query. By doing so, I aim to analyze and understand the
> dependencies between queries and the data they operate on. This information
> will provide us new insights in how data is used across our data platform.
>
> I have attempted to leverage the Spark SQL Antlr grammar, available at
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
> to parse the queries retrieved from the query history. Unfortunately, I
> have encountered difficulties when parsing more complex queries.
>
> As an example, I have struggled to parse queries with intricate constructs
> such as the following:
>
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
>
> I would greatly appreciate it if you could provide me with some guidance
> on how to overcome these challenges. Specifically, I am interested in
> understanding if there are alternative approaches or existing tools that
> can help me achieve my goal of extracting the data objects used by queries
> from the Databricks query history.
>
> Additionally, if there are any resources, documentation, or examples that
> provide further clarity on this topic, I would be more than grateful to
> receive them. Any insights you can provide would be of immense help in
> advancing my understanding and enabling me to make the most of the Spark
> SQL and Databricks ecosystem.
>
> Thank you very much for your time and support. I eagerly look forward to
> hearing from you and benefiting from your expertise.
>
> Best regards,
> Ruben Mennes
>


[Spark SQL] Data objects from query history

2023-06-30 Thread Ruben Mennes
Dear Apache Spark community,

I hope this email finds you well. My name is Ruben, and I am an enthusiastic 
user of Apache Spark, specifically through the Databricks platform. I am 
reaching out to you today to seek your assistance and guidance regarding a 
specific use case.

I have been exploring the capabilities of Spark SQL and Databricks, and I have 
encountered a challenge related to accessing the data objects used by queries 
from the query history. I am aware that Databricks provides a comprehensive 
query history that contains valuable information about executed queries.

However, my objective is to extract the underlying data objects (tables) 
involved in each query. By doing so, I aim to analyze and understand the 
dependencies between queries and the data they operate on. This information 
will provide us new insights in how data is used across our data platform.

I have attempted to leverage the Spark SQL Antlr grammar, available at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
 to parse the queries retrieved from the query history. Unfortunately, I have 
encountered difficulties when parsing more complex queries.

As an example, I have struggled to parse queries with intricate constructs such 
as the following:
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
I would greatly appreciate it if you could provide me with some guidance on how 
to overcome these challenges. Specifically, I am interested in understanding if 
there are alternative approaches or existing tools that can help me achieve my 
goal of extracting the data objects used by queries from the Databricks query 
history.

Additionally, if there are any resources, documentation, or examples that 
provide further clarity on this topic, I would be more than grateful to receive 
them. Any insights you can provide would be of immense help in advancing my 
understanding and enabling me to make the most of the Spark SQL and Databricks 
ecosystem.

Thank you very much for your time and support. I eagerly look forward to 
hearing from you and benefiting from your expertise.

Best regards,
Ruben Mennes