[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

HyukjinKwon Sun, 09 Dec 2018 18:07:48 -0800

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263

> Visualizing a workflow is nice, but Spark's Pipelines are typically
pretty straightforward and linear. I could imagine producing a nicer
visualization than what you get from reading the Spark UI, although of course
we already have some degree of history and data there.

Another good thing that might have to be considered is, that we can
interact this with other SQL events. For instance, where the input `Dataset` is
originated. For instance, with current Apache Spark, I can visualises SQL
operations as below:

![screen shot 2018-12-10 at 9 41 36
am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)

I think we can combine those existing lineages together to easily
understand where the data comes and goes.

(BTW, I hope it doesn't sound like I'm pushing this case for my one
specific case - I think this hook-like feature can be useful in many way. I
currently have one explicit example to show so I'm referring my case.)

> These are just the hooks, right? someone would have to implement
something to use these events. I see the value in the API to some degree, but
with no concrete implementation, does it add anything for Spark users out of
the box?

Yes, right. It does not add anything to Spark users out of the box. It
needs a custom implementation for a query listener. For instance,

with the custom listener below:

```scala
class CustomSparkListener extends SparkListener
def onOtherEvents(e: SparkListenerEvent) = e match {
case e: MLEvent => // do something
case _ => // pass
}
```

There are two (existing) ways to use this.

```scala
spark.sparkContext.addSparkListener(new CustomSparkListener)
```

```bash
spark-submit
--conf spark.extraListeners=CustomSparkListener\
...
```

It's also similar with other existing implementation in SQL side (catalog
events described above in PR description).

One actual example that I had with SQL query listener was that I had to
close one connection every time after SQL execution.

> Is that what someone would likely do? or would someone likely have to run
Atlas to use this? If that's a good example of the use case, and Atlas is
really about lineage and governance, is that the thrust of this change, to help
with something to do with model lineage and reproducibility?

There are two reasons.

1. I think someone in general would likely utilise this feature like other
event listeners. At least, I can see some interests going on outside.

- SQL Listener
-
https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query
-
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html

- Streaming Query Listener
- https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
-
http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416

2. Someone would likely run this via Atlas so that someone could do
something about lineage and governance, yes, as you said. Yes, I'm trying to
show integrated lineages in Apache Spark but this is a missing hole. There had
to be this change for this.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

Reply via email to