GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23261
[SPARK-23674][ML] Adds Spark ML Events ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track add some actions for them. ## Introduction This PR proposes to send some ML events like SQL. This is quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Spark with Atlas, and with some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) I think not to mention how useful it is to track the SQL operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now). ## Implementation Details ### Sends event (but not expose ML specific listener) In `events.scala`, it adds: ```scala @Unstable case class ...Events object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1.. Catalog events (see `org.apache.spark.sql.catalyst.catalog.events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possbiel instances but IMHO it's invasive. Also, exposing another ML specific listener sounds a bit too much for current status. Therefore, I simply borrowed file name and structures here **2.. SQL execution events (see `org.apache.spark.sql.execution.SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl() } + def fitImpl(...): Model ``` **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl() } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1.. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2.. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176 Inherited ones are intentionally omitted here for simplicity. They are inherited and implemented at multiple places. ## Backword Compatibility This keeps both source and binary backward compatibility. I was thinking enforcing `...Impl` by leaving it abstract methods but just decided to leave a body that throws `UnsupportedOperationException` so that we can keep full source and binary compatibilities. - For user-faced API perspective, there's no difference. `...Impl` methods are protected and not visible to end users. - For developer API perspective, if some developers want to `...` methods instead of `...Impl`, that's still fine. It only does not handle events. If developers want to handle events from their custom implementation, they should implement `...Impl`. Of course, it is encouraged to implement `...Impl` ## How was this patch tested? Manually tested and unit tests were added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-23674 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23261.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23261 ---- commit 56805d60004142106b400b94eb94f8cf87486494 Author: Hyukjin Kwon <gurwls223@...> Date: 2018-12-05T06:38:05Z Adds Spark ML Events ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org