[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-20 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r673587148



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -1947,6 +1947,31 @@ class Dataset[T] private[sql](
 CollectMetrics(name, (expr +: exprs).map(_.named), logicalPlan)
   }
 
+  /**
+   * Observe (named) metrics through an `org.apache.spark.sql.Observation` 
instance.
+   * This is equivalent to calling `observe(String, Column, Column*)` but does 
not require
+   * adding `org.apache.spark.sql.util.QueryExecutionListener` to the spark 
session.
+   * This method does not support streaming datasets.
+   *
+   * A user can retrieve the metrics by accessing 
`org.apache.spark.sql.Observation.get`.
+   *
+   * {{{
+   *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+   *   val observation = Observation("my_metrics")
+   *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+   *   observed_ds.write.parquet("ds.parquet")
+   *   val metrics = observation.get
+   * }}}
+   *
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(this.isStreaming == true)
+   *
+   * @group typedrel
+   * @since 3.3.0
+   */
+  def observe(observation: Observation, expr: Column, exprs: Column*): 
Dataset[T] = {

Review comment:
   That's fine. Let's don't add it for now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-20 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r673112471



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -1947,6 +1947,31 @@ class Dataset[T] private[sql](
 CollectMetrics(name, (expr +: exprs).map(_.named), logicalPlan)
   }
 
+  /**
+   * Observe (named) metrics through an `org.apache.spark.sql.Observation` 
instance.
+   * This is equivalent to calling `observe(String, Column, Column*)` but does 
not require
+   * adding `org.apache.spark.sql.util.QueryExecutionListener` to the spark 
session.
+   * This method does not support streaming datasets.
+   *
+   * A user can retrieve the metrics by accessing 
`org.apache.spark.sql.Observation.get`.
+   *
+   * {{{
+   *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+   *   val observation = Observation("my_metrics")
+   *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+   *   observed_ds.write.parquet("ds.parquet")
+   *   val metrics = observation.get
+   * }}}
+   *
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(this.isStreaming == true)
+   *
+   * @group typedrel
+   * @since 3.3.0
+   */
+  def observe(observation: Observation, expr: Column, exprs: Column*): 
Dataset[T] = {

Review comment:
   oh yeah!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-20 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672721125



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.UUID
+
+import org.apache.spark.sql.execution.QueryExecution
+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+/**
+ * Helper class to simplify usage of `Dataset.observe(String, Column, 
Column*)`:
+ *
+ * {{{
+ *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+ *   val observation = Observation("my metrics")
+ *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+ *   observed_ds.write.parquet("ds.parquet")
+ *   val metrics = observation.get
+ * }}}
+ *
+ * This collects the metrics while the first action is executed on the 
observed dataset. Subsequent
+ * actions do not modify the metrics returned by [[get]]. Retrieval of the 
metric via [[get]]
+ * blocks until the first action has finished and metrics become available.
+ *
+ * This class does not support streaming datasets.
+ *
+ * @param name name of the metric
+ * @since 3.3.0
+ */
+class Observation(name: String) {
+
+  private val listener: ObservationListener = ObservationListener(this)
+
+  @volatile private var sparkSession: Option[SparkSession] = None
+
+  @volatile private var row: Option[Row] = None
+
+  /**
+   * Attach this observation to the given [[Dataset]] to observe aggregation 
expressions.
+   *
+   * @param ds dataset
+   * @param expr first aggregation expression
+   * @param exprs more aggregation expressions
+   * @tparam T dataset type
+   * @return observed dataset
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(ds.isStreaming == true)
+   */
+  private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): 
Dataset[T] = {
+if (ds.isStreaming) {
+  throw new IllegalArgumentException("Observation does not support 
streaming Datasets")
+}
+register(ds.sparkSession)
+ds.observe(name, expr, exprs: _*)
+  }
+
+  /**
+   * Get the observed metrics. This waits for the observed dataset to finish 
its first action.
+   * Only the result of the first action is available. Subsequent actions do 
not modify the result.
+   *
+   * @return the observed metrics as a [[Row]]
+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def get: Row = {
+synchronized {
+  // we need to loop as wait might return without us calling notify
+  // 
https://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=992601610
+  while (this.row.isEmpty) {
+wait()
+  }
+}
+
+this.row.get
+  }
+
+  private def register(sparkSession: SparkSession): Unit = {
+// makes this class thread-safe:
+// only the first thread entering this block can set sparkSession
+// all other threads will see the exception, as it is only allowed to do 
this once
+synchronized {
+  if (this.sparkSession.isDefined) {
+throw new IllegalArgumentException("An Observation can be used with a 
Dataset only once")
+  }
+  this.sparkSession = Some(sparkSession)
+}
+
+sparkSession.listenerManager.register(this.listener)
+  }
+
+  private def unregister(): Unit = {
+this.sparkSession.foreach(_.listenerManager.unregister(this.listener))
+  }
+
+  private[spark] def onFinish(qe: QueryExecution): Unit = {
+synchronized {
+  if (this.row.isEmpty) {
+this.row = qe.observedMetrics.get(name)
+if (this.row.isDefined) {
+  notifyAll()
+  unregister()
+}
+  }
+}
+  }
+
+}
+
+private[sql] case class ObservationListener(observation: Observation)
+  extends QueryExecutionListener {
+
+  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
Long): Unit =
+observation.onFinish(qe)
+
+  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit =
+observation.onFinish(qe)
+
+}
+
+/**
+ * (Scala-specific) Create a named or anonym

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-20 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672980755



##
File path: 
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
##
@@ -523,4 +523,55 @@ public void testUDF() {
   .map(row -> row.get(0).toString() + 
row.getString(1)).toArray(String[]::new);
 Assert.assertArrayEquals(expected, result);
   }
+
+  /**
+   * Tests the Java API of Observation and Dataset.observe(Observation, 
Column, Column*).
+   */
+  @Test
+  public void testObservation() {
+Observation namedObservation = new Observation("named");
+Observation unnamedObservation = new Observation();
+
+Dataset df = spark
+.range(100)
+.observe(
+namedObservation,
+min(col("id")).as("min_val"),
+
scala.collection.JavaConverters.asScalaBuffer(Arrays.asList(
+max(col("id")).as("max_val"),
+sum(col("id")).as("sum_val"),
+count(when(pmod(col("id"), lit(2)).$eq$eq$eq(0), 
1)).as("num_even")
+))
+)
+.observe(
+unnamedObservation,
+avg(col("id")).cast("int").as("avg_val"),
+
scala.collection.JavaConverters.asScalaBuffer(Arrays.asList())
+);
+
+df.collect();
+Row namedMetrics = null;
+Row unnamedMetrics = null;
+
+try {
+  // we can get the result multiple times
+  namedMetrics = namedObservation.get();
+  unnamedMetrics = unnamedObservation.get();
+} catch (InterruptedException e) {
+  Assert.fail();
+}
+Assert.assertEquals(Arrays.asList(0L, 99L, 4950L, 50L), 
scala.collection.JavaConverters.seqAsJavaList(namedMetrics.toSeq()));

Review comment:
   I think the current way is fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-20 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672721125



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.UUID
+
+import org.apache.spark.sql.execution.QueryExecution
+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+/**
+ * Helper class to simplify usage of `Dataset.observe(String, Column, 
Column*)`:
+ *
+ * {{{
+ *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+ *   val observation = Observation("my metrics")
+ *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+ *   observed_ds.write.parquet("ds.parquet")
+ *   val metrics = observation.get
+ * }}}
+ *
+ * This collects the metrics while the first action is executed on the 
observed dataset. Subsequent
+ * actions do not modify the metrics returned by [[get]]. Retrieval of the 
metric via [[get]]
+ * blocks until the first action has finished and metrics become available.
+ *
+ * This class does not support streaming datasets.
+ *
+ * @param name name of the metric
+ * @since 3.3.0
+ */
+class Observation(name: String) {
+
+  private val listener: ObservationListener = ObservationListener(this)
+
+  @volatile private var sparkSession: Option[SparkSession] = None
+
+  @volatile private var row: Option[Row] = None
+
+  /**
+   * Attach this observation to the given [[Dataset]] to observe aggregation 
expressions.
+   *
+   * @param ds dataset
+   * @param expr first aggregation expression
+   * @param exprs more aggregation expressions
+   * @tparam T dataset type
+   * @return observed dataset
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(ds.isStreaming == true)
+   */
+  private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): 
Dataset[T] = {
+if (ds.isStreaming) {
+  throw new IllegalArgumentException("Observation does not support 
streaming Datasets")
+}
+register(ds.sparkSession)
+ds.observe(name, expr, exprs: _*)
+  }
+
+  /**
+   * Get the observed metrics. This waits for the observed dataset to finish 
its first action.
+   * Only the result of the first action is available. Subsequent actions do 
not modify the result.
+   *
+   * @return the observed metrics as a [[Row]]
+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def get: Row = {
+synchronized {
+  // we need to loop as wait might return without us calling notify
+  // 
https://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=992601610
+  while (this.row.isEmpty) {
+wait()
+  }
+}
+
+this.row.get
+  }
+
+  private def register(sparkSession: SparkSession): Unit = {
+// makes this class thread-safe:
+// only the first thread entering this block can set sparkSession
+// all other threads will see the exception, as it is only allowed to do 
this once
+synchronized {
+  if (this.sparkSession.isDefined) {
+throw new IllegalArgumentException("An Observation can be used with a 
Dataset only once")
+  }
+  this.sparkSession = Some(sparkSession)
+}
+
+sparkSession.listenerManager.register(this.listener)
+  }
+
+  private def unregister(): Unit = {
+this.sparkSession.foreach(_.listenerManager.unregister(this.listener))
+  }
+
+  private[spark] def onFinish(qe: QueryExecution): Unit = {
+synchronized {
+  if (this.row.isEmpty) {
+this.row = qe.observedMetrics.get(name)
+if (this.row.isDefined) {
+  notifyAll()
+  unregister()
+}
+  }
+}
+  }
+
+}
+
+private[sql] case class ObservationListener(observation: Observation)
+  extends QueryExecutionListener {
+
+  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
Long): Unit =
+observation.onFinish(qe)
+
+  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit =
+observation.onFinish(qe)
+
+}
+
+/**
+ * (Scala-specific) Create a named or anonym

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-20 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672937934



##
File path: 
sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
##
@@ -523,4 +523,55 @@ public void testUDF() {
   .map(row -> row.get(0).toString() + 
row.getString(1)).toArray(String[]::new);
 Assert.assertArrayEquals(expected, result);
   }
+
+  /**
+   * Tests the Java API of Observation and Dataset.observe(Observation, 
Column, Column*).
+   */
+  @Test
+  public void testObservation() {
+Observation namedObservation = new Observation("named");
+Observation unnamedObservation = new Observation();
+
+Dataset df = spark
+.range(100)

Review comment:
   Actually we should keep it with 2 space indentation here same as Scala 
side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-19 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672800804



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.UUID
+
+import org.apache.spark.sql.execution.QueryExecution
+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+/**
+ * Helper class to simplify usage of `Dataset.observe(String, Column, 
Column*)`:
+ *
+ * {{{
+ *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+ *   val observation = Observation("my metrics")
+ *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+ *   observed_ds.write.parquet("ds.parquet")
+ *   val metrics = observation.get
+ * }}}
+ *
+ * This collects the metrics while the first action is executed on the 
observed dataset. Subsequent
+ * actions do not modify the metrics returned by [[get]]. Retrieval of the 
metric via [[get]]
+ * blocks until the first action has finished and metrics become available.
+ *
+ * This class does not support streaming datasets.
+ *
+ * @param name name of the metric
+ * @since 3.3.0
+ */
+class Observation(name: String) {
+
+  private val listener: ObservationListener = ObservationListener(this)
+
+  @volatile private var sparkSession: Option[SparkSession] = None
+
+  @volatile private var row: Option[Row] = None
+
+  /**
+   * Attach this observation to the given [[Dataset]] to observe aggregation 
expressions.
+   *
+   * @param ds dataset
+   * @param expr first aggregation expression
+   * @param exprs more aggregation expressions
+   * @tparam T dataset type
+   * @return observed dataset
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(ds.isStreaming == true)
+   */
+  private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): 
Dataset[T] = {
+if (ds.isStreaming) {
+  throw new IllegalArgumentException("Observation does not support 
streaming Datasets")
+}
+register(ds.sparkSession)
+ds.observe(name, expr, exprs: _*)
+  }
+
+  /**
+   * Get the observed metrics. This waits for the observed dataset to finish 
its first action.
+   * Only the result of the first action is available. Subsequent actions do 
not modify the result.
+   *
+   * @return the observed metrics as a [[Row]]
+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def get: Row = {
+synchronized {
+  // we need to loop as wait might return without us calling notify
+  // 
https://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=992601610
+  while (this.row.isEmpty) {
+wait()
+  }
+}
+
+this.row.get
+  }
+
+  private def register(sparkSession: SparkSession): Unit = {
+// makes this class thread-safe:
+// only the first thread entering this block can set sparkSession
+// all other threads will see the exception, as it is only allowed to do 
this once
+synchronized {
+  if (this.sparkSession.isDefined) {
+throw new IllegalArgumentException("An Observation can be used with a 
Dataset only once")
+  }
+  this.sparkSession = Some(sparkSession)
+}
+
+sparkSession.listenerManager.register(this.listener)
+  }
+
+  private def unregister(): Unit = {
+this.sparkSession.foreach(_.listenerManager.unregister(this.listener))
+  }
+
+  private[spark] def onFinish(qe: QueryExecution): Unit = {
+synchronized {
+  if (this.row.isEmpty) {
+this.row = qe.observedMetrics.get(name)
+if (this.row.isDefined) {
+  notifyAll()
+  unregister()
+}
+  }
+}
+  }
+
+}
+
+private[sql] case class ObservationListener(observation: Observation)
+  extends QueryExecutionListener {
+
+  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
Long): Unit =
+observation.onFinish(qe)
+
+  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit =
+observation.onFinish(qe)
+
+}
+
+/**
+ * (Scala-specific) Create a named or anonym

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-19 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672800470



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.UUID
+
+import org.apache.spark.sql.execution.QueryExecution
+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+/**
+ * Helper class to simplify usage of `Dataset.observe(String, Column, 
Column*)`:
+ *
+ * {{{
+ *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+ *   val observation = Observation("my metrics")
+ *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+ *   observed_ds.write.parquet("ds.parquet")
+ *   val metrics = observation.get
+ * }}}
+ *
+ * This collects the metrics while the first action is executed on the 
observed dataset. Subsequent
+ * actions do not modify the metrics returned by [[get]]. Retrieval of the 
metric via [[get]]
+ * blocks until the first action has finished and metrics become available.
+ *
+ * This class does not support streaming datasets.
+ *
+ * @param name name of the metric
+ * @since 3.3.0
+ */
+class Observation(name: String) {
+
+  private val listener: ObservationListener = ObservationListener(this)
+
+  @volatile private var sparkSession: Option[SparkSession] = None
+
+  @volatile private var row: Option[Row] = None
+
+  /**
+   * Attach this observation to the given [[Dataset]] to observe aggregation 
expressions.
+   *
+   * @param ds dataset
+   * @param expr first aggregation expression
+   * @param exprs more aggregation expressions
+   * @tparam T dataset type
+   * @return observed dataset
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(ds.isStreaming == true)
+   */
+  private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): 
Dataset[T] = {
+if (ds.isStreaming) {
+  throw new IllegalArgumentException("Observation does not support 
streaming Datasets")
+}
+register(ds.sparkSession)
+ds.observe(name, expr, exprs: _*)
+  }
+
+  /**
+   * Get the observed metrics. This waits for the observed dataset to finish 
its first action.
+   * Only the result of the first action is available. Subsequent actions do 
not modify the result.
+   *
+   * @return the observed metrics as a [[Row]]
+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def get: Row = {
+synchronized {
+  // we need to loop as wait might return without us calling notify
+  // 
https://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=992601610
+  while (this.row.isEmpty) {
+wait()
+  }
+}
+
+this.row.get
+  }
+
+  private def register(sparkSession: SparkSession): Unit = {
+// makes this class thread-safe:
+// only the first thread entering this block can set sparkSession
+// all other threads will see the exception, as it is only allowed to do 
this once
+synchronized {
+  if (this.sparkSession.isDefined) {
+throw new IllegalArgumentException("An Observation can be used with a 
Dataset only once")
+  }
+  this.sparkSession = Some(sparkSession)
+}
+
+sparkSession.listenerManager.register(this.listener)
+  }
+
+  private def unregister(): Unit = {
+this.sparkSession.foreach(_.listenerManager.unregister(this.listener))
+  }
+
+  private[spark] def onFinish(qe: QueryExecution): Unit = {
+synchronized {
+  if (this.row.isEmpty) {
+this.row = qe.observedMetrics.get(name)
+if (this.row.isDefined) {
+  notifyAll()
+  unregister()
+}
+  }
+}
+  }
+
+}
+
+private[sql] case class ObservationListener(observation: Observation)
+  extends QueryExecutionListener {
+
+  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
Long): Unit =
+observation.onFinish(qe)
+
+  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit =
+observation.onFinish(qe)
+
+}
+
+/**
+ * (Scala-specific) Create a named or anonym

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-19 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672800470



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.UUID
+
+import org.apache.spark.sql.execution.QueryExecution
+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+/**
+ * Helper class to simplify usage of `Dataset.observe(String, Column, 
Column*)`:
+ *
+ * {{{
+ *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+ *   val observation = Observation("my metrics")
+ *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+ *   observed_ds.write.parquet("ds.parquet")
+ *   val metrics = observation.get
+ * }}}
+ *
+ * This collects the metrics while the first action is executed on the 
observed dataset. Subsequent
+ * actions do not modify the metrics returned by [[get]]. Retrieval of the 
metric via [[get]]
+ * blocks until the first action has finished and metrics become available.
+ *
+ * This class does not support streaming datasets.
+ *
+ * @param name name of the metric
+ * @since 3.3.0
+ */
+class Observation(name: String) {
+
+  private val listener: ObservationListener = ObservationListener(this)
+
+  @volatile private var sparkSession: Option[SparkSession] = None
+
+  @volatile private var row: Option[Row] = None
+
+  /**
+   * Attach this observation to the given [[Dataset]] to observe aggregation 
expressions.
+   *
+   * @param ds dataset
+   * @param expr first aggregation expression
+   * @param exprs more aggregation expressions
+   * @tparam T dataset type
+   * @return observed dataset
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(ds.isStreaming == true)
+   */
+  private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): 
Dataset[T] = {
+if (ds.isStreaming) {
+  throw new IllegalArgumentException("Observation does not support 
streaming Datasets")
+}
+register(ds.sparkSession)
+ds.observe(name, expr, exprs: _*)
+  }
+
+  /**
+   * Get the observed metrics. This waits for the observed dataset to finish 
its first action.
+   * Only the result of the first action is available. Subsequent actions do 
not modify the result.
+   *
+   * @return the observed metrics as a [[Row]]
+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def get: Row = {
+synchronized {
+  // we need to loop as wait might return without us calling notify
+  // 
https://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=992601610
+  while (this.row.isEmpty) {
+wait()
+  }
+}
+
+this.row.get
+  }
+
+  private def register(sparkSession: SparkSession): Unit = {
+// makes this class thread-safe:
+// only the first thread entering this block can set sparkSession
+// all other threads will see the exception, as it is only allowed to do 
this once
+synchronized {
+  if (this.sparkSession.isDefined) {
+throw new IllegalArgumentException("An Observation can be used with a 
Dataset only once")
+  }
+  this.sparkSession = Some(sparkSession)
+}
+
+sparkSession.listenerManager.register(this.listener)
+  }
+
+  private def unregister(): Unit = {
+this.sparkSession.foreach(_.listenerManager.unregister(this.listener))
+  }
+
+  private[spark] def onFinish(qe: QueryExecution): Unit = {
+synchronized {
+  if (this.row.isEmpty) {
+this.row = qe.observedMetrics.get(name)
+if (this.row.isDefined) {
+  notifyAll()
+  unregister()
+}
+  }
+}
+  }
+
+}
+
+private[sql] case class ObservationListener(observation: Observation)
+  extends QueryExecutionListener {
+
+  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
Long): Unit =
+observation.onFinish(qe)
+
+  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit =
+observation.onFinish(qe)
+
+}
+
+/**
+ * (Scala-specific) Create a named or anonym

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

2021-07-19 Thread GitBox


HyukjinKwon commented on a change in pull request #33422:
URL: https://github.com/apache/spark/pull/33422#discussion_r672721125



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.UUID
+
+import org.apache.spark.sql.execution.QueryExecution
+import org.apache.spark.sql.util.QueryExecutionListener
+
+
+/**
+ * Helper class to simplify usage of `Dataset.observe(String, Column, 
Column*)`:
+ *
+ * {{{
+ *   // Observe row count (rows) and highest id (maxid) in the Dataset while 
writing it
+ *   val observation = Observation("my metrics")
+ *   val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), 
max($"id").as("maxid"))
+ *   observed_ds.write.parquet("ds.parquet")
+ *   val metrics = observation.get
+ * }}}
+ *
+ * This collects the metrics while the first action is executed on the 
observed dataset. Subsequent
+ * actions do not modify the metrics returned by [[get]]. Retrieval of the 
metric via [[get]]
+ * blocks until the first action has finished and metrics become available.
+ *
+ * This class does not support streaming datasets.
+ *
+ * @param name name of the metric
+ * @since 3.3.0
+ */
+class Observation(name: String) {
+
+  private val listener: ObservationListener = ObservationListener(this)
+
+  @volatile private var sparkSession: Option[SparkSession] = None
+
+  @volatile private var row: Option[Row] = None
+
+  /**
+   * Attach this observation to the given [[Dataset]] to observe aggregation 
expressions.
+   *
+   * @param ds dataset
+   * @param expr first aggregation expression
+   * @param exprs more aggregation expressions
+   * @tparam T dataset type
+   * @return observed dataset
+   * @throws IllegalArgumentException If this is a streaming Dataset 
(ds.isStreaming == true)
+   */
+  private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): 
Dataset[T] = {
+if (ds.isStreaming) {
+  throw new IllegalArgumentException("Observation does not support 
streaming Datasets")
+}
+register(ds.sparkSession)
+ds.observe(name, expr, exprs: _*)
+  }
+
+  /**
+   * Get the observed metrics. This waits for the observed dataset to finish 
its first action.
+   * Only the result of the first action is available. Subsequent actions do 
not modify the result.
+   *
+   * @return the observed metrics as a [[Row]]
+   * @throws InterruptedException interrupted while waiting
+   */
+  @throws[InterruptedException]
+  def get: Row = {
+synchronized {
+  // we need to loop as wait might return without us calling notify
+  // 
https://en.wikipedia.org/w/index.php?title=Spurious_wakeup&oldid=992601610
+  while (this.row.isEmpty) {
+wait()
+  }
+}
+
+this.row.get
+  }
+
+  private def register(sparkSession: SparkSession): Unit = {
+// makes this class thread-safe:
+// only the first thread entering this block can set sparkSession
+// all other threads will see the exception, as it is only allowed to do 
this once
+synchronized {
+  if (this.sparkSession.isDefined) {
+throw new IllegalArgumentException("An Observation can be used with a 
Dataset only once")
+  }
+  this.sparkSession = Some(sparkSession)
+}
+
+sparkSession.listenerManager.register(this.listener)
+  }
+
+  private def unregister(): Unit = {
+this.sparkSession.foreach(_.listenerManager.unregister(this.listener))
+  }
+
+  private[spark] def onFinish(qe: QueryExecution): Unit = {
+synchronized {
+  if (this.row.isEmpty) {
+this.row = qe.observedMetrics.get(name)
+if (this.row.isDefined) {
+  notifyAll()
+  unregister()
+}
+  }
+}
+  }
+
+}
+
+private[sql] case class ObservationListener(observation: Observation)
+  extends QueryExecutionListener {
+
+  override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
Long): Unit =
+observation.onFinish(qe)
+
+  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit =
+observation.onFinish(qe)
+
+}
+
+/**
+ * (Scala-specific) Create a named or anonym