subject:"\[jira\] \[Assigned\] \(SPARK\-34806\) Helper class for batch Dataset.observe\(\)"

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

2021-07-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34806:
---

Assignee: Enrico Minack

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

2021-03-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34806:


Assignee: (was: Apache Spark)

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

2021-03-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34806:


Assignee: Apache Spark

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Minor
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

3 matches

Site Navigation

Mail list logo

Footer information