GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/23261

    [SPARK-23674][ML] Adds Spark ML Events

    ## What changes were proposed in this pull request?
    
    This PR proposes to add ML events so that other developers can track add 
some actions for them.
    
    ## Introduction
    
    This PR proposes to send some ML events like SQL. This is quite useful when 
people want to track and make some actions for corresponding ML operations. For 
instance, I have been working on integrating Spark with Atlas, and with some 
custom changes with this PR, I can visualise ML pipeline as below:
    
    
![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)
    
    I think not to mention how useful it is to track the SQL operations. 
Likewise, I would like to propose ML events as well (as lowest stability 
`@Unstable` APIs for now).
    
    ## Implementation Details
    
    ### Sends event (but not expose ML specific listener)
    
    In `events.scala`, it adds:
    
    ```scala
    @Unstable
    case class ...Events
    
    object MLEvents {
      // Wrappers to send events:
      // def with...Event(body) = {
      //   body()
      //   SparkContext.getOrCreate().listenerBus.post(event)
      // }
    }
    ```
    
    This way mimics both:
    
    **1.. Catalog events (see 
`org.apache.spark.sql.catalyst.catalog.events.scala`)**
    
    - This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 
    - It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
    which delegates the operations to `ExternalCatalog`
    
    This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possbiel instances but IMHO it's 
invasive. Also, exposing another ML specific listener sounds a bit too much for 
current status. Therefore, I simply borrowed file name and structures here
    
    **2.. SQL execution events (see 
`org.apache.spark.sql.execution.SQLExecution.scala`)**
    
    - Add an object that wraps a body to send events
    
    Current apporach is rather close to this. It has a `with...` wrapper to 
send events. I borrowed this approach to be consistent.
    
    
    ### Add `...Impl` methods to wrap each to send events
    
    **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**
    
    ```diff
    - def save(...) = { saveImpl(...) }
    + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
      def saveImpl(...): Unit = ...
    ```
    
      Note that `saveImpl` was already implemented unlike other instances below.
    
    
    ```diff
    - def load(...): T
    + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
    + def loadImpl(...): T
    ```
    
    **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**
    
    ```diff
    - def fit(...): Model
    + def fit(...): Model = MLEvents.withFitEvent { fitImpl() }
    + def fitImpl(...): Model
    ```
    
    **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**
    
    ```diff
    - def transform(...): DataFrame
    + def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl() }
    + def transformImpl(...): DataFrame
    ```
    
    This approach follows the existing way as below in ML:
    
    **1.. `transform` and `transformImpl`**
    
    
https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213
    
    
https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196
    
    
https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042
    
    **2.. `save` and `saveImpl`**
    
    
https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176
    
    Inherited ones are intentionally omitted here for simplicity. They are 
inherited and implemented at multiple places.
    
    ## Backword Compatibility
    
    This keeps both source and binary backward compatibility. I was thinking 
enforcing `...Impl` by leaving it abstract methods but just decided to leave a 
body that throws `UnsupportedOperationException` so that we can keep full 
source and binary compatibilities.
    
    - For user-faced API perspective, there's no difference. `...Impl` methods 
are protected and not visible to end users.
    
    - For developer API perspective, if some developers want to `...` methods 
instead of `...Impl`, that's still fine. It only does not handle events. If 
developers want to handle events from their custom implementation, they should 
implement `...Impl`. Of course, it is encouraged to implement `...Impl`
    
    ## How was this patch tested?
    
    Manually tested and unit tests were added.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-23674

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23261.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23261
    
----
commit 56805d60004142106b400b94eb94f8cf87486494
Author: Hyukjin Kwon <gurwls223@...>
Date:   2018-12-05T06:38:05Z

    Adds Spark ML Events

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to