Observable Metrics on Spark Datasets

Enrico Minack Mon, 15 Mar 2021 02:59:38 -0700

Hi Spark-Devs,

the observable metrics that have been added to the Dataset API in 3.0.0are a great improvement over the Accumulator APIs that seem to have muchweaker guarantees. I have two questions regarding follow-up contributions:


*1. Add observe to Python ***DataFrame**

As I can see from master branch, there is no equivalent in the PythonAPI. Is this something planned to happen, or is it missing because thereare not listeners in PySpark which renders this method useless inPython. I would be happy to contribute here.


*2. Add Observation class to simplify result access
*

The Dataset.observe method requires users to register listeners<https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#observe(name:String,expr:org.apache.spark.sql.Column,exprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]>with QueryExecutionListener or StreamingQUeryListener to obtain theresult. I think for simple setups, this could be hidden behind a commonhelper class here called Observation, which reduces the usage of observeto three lines of code:


// our Dataset (this does not count as a line of code) val df =Seq((1, "a"), (2, "b"), (4, "c"), (8, 
"d")).toDF("id", "value")

// define the observation we want to make val observation =Observation("stats", 
count($"id"), sum($"id"))

// add the observation to the Dataset and execute an action on it val cnt = 
df.observe(observation).count()

// retrieve the result assert(observation.get ===Row(4, 15))

The Observation class can handle the registration and de-registration ofthe listener, as well as properly accessing the result across threadboundaries.

With *2.*, the observe method can be added to PySpark withoutintroducing listeners there at all. All the logic is happening in the JVM.


Thanks for your thoughts on this.

Regards,
Enrico

Observable Metrics on Spark Datasets

Reply via email to