[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240004006 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") --- End diff -- Yes, that was my intention. I wanted to force to implement `fitImpl` but was thinking that might be too breaking change (it's going to at least break source compatibility). I am willing to follow other suggestions - I am pretty sure you or other guys are more familiar with ML side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003952 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") --- End diff -- For current change, Spark ML developers can still choose to override `fit` instead `fitImpl` so their ML model can work without ML event? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003885 --- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.io.File + +import scala.collection.mutable +import scala.concurrent.duration._ + +import org.apache.hadoop.fs.Path +import org.mockito.Matchers.{any, eq => meq} +import org.mockito.Mockito.when +import org.scalatest.BeforeAndAfterEach +import org.scalatest.concurrent.Eventually +import org.scalatest.mockito.MockitoSugar.mock + +import org.apache.spark.{SparkContext, SparkFunSuite} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent} +import org.apache.spark.sql._ +import org.apache.spark.util.Utils + + +class MLEventsSuite +extends SparkFunSuite +with BeforeAndAfterEach +with DefaultReadWriteTest +with Eventually { + + private var spark: SparkSession = _ + private var sc: SparkContext = _ + private var checkpointDir: String = _ + private var listener: SparkListener = _ + private val dirName: String = "pipeline" + private val events = mutable.ArrayBuffer.empty[MLEvent] + + override def beforeAll(): Unit = { +super.beforeAll() +sc = new SparkContext("local[2]", "SparkListenerSuite") +listener = new SparkListener { + override def onOtherEvent(event: SparkListenerEvent): Unit = event match { +case e: FitStart[_] => events.append(e) +case e: FitEnd[_] => events.append(e) +case e: TransformStart => events.append(e) +case e: TransformEnd => events.append(e) +case e: SaveInstanceStart if e.path.endsWith(dirName) => events.append(e) +case e: SaveInstanceEnd if e.path.endsWith(dirName) => events.append(e) +case _ => + } +} +sc.addSparkListener(listener) + +spark = SparkSession.builder() + .sparkContext(sc) + .getOrCreate() + +checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, "checkpoints").toString +sc.setCheckpointDir(checkpointDir) --- End diff -- Let me double check and address this while fixing the test. I just copied this from `MLlibTestSparkContext`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala --- @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) --- End diff -- Ah, it's there just only to keep the `@Since`. Looks some classes don't explicitly note that so I didn't call `super` in other places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003674 --- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.io.File + +import scala.collection.mutable +import scala.concurrent.duration._ + +import org.apache.hadoop.fs.Path +import org.mockito.Matchers.{any, eq => meq} +import org.mockito.Mockito.when +import org.scalatest.BeforeAndAfterEach +import org.scalatest.concurrent.Eventually +import org.scalatest.mockito.MockitoSugar.mock + +import org.apache.spark.{SparkContext, SparkFunSuite} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent} +import org.apache.spark.sql._ +import org.apache.spark.util.Utils + + +class MLEventsSuite +extends SparkFunSuite +with BeforeAndAfterEach +with DefaultReadWriteTest +with Eventually { + + private var spark: SparkSession = _ + private var sc: SparkContext = _ + private var checkpointDir: String = _ + private var listener: SparkListener = _ + private val dirName: String = "pipeline" + private val events = mutable.ArrayBuffer.empty[MLEvent] + + override def beforeAll(): Unit = { +super.beforeAll() +sc = new SparkContext("local[2]", "SparkListenerSuite") +listener = new SparkListener { + override def onOtherEvent(event: SparkListenerEvent): Unit = event match { +case e: FitStart[_] => events.append(e) +case e: FitEnd[_] => events.append(e) +case e: TransformStart => events.append(e) +case e: TransformEnd => events.append(e) +case e: SaveInstanceStart if e.path.endsWith(dirName) => events.append(e) +case e: SaveInstanceEnd if e.path.endsWith(dirName) => events.append(e) +case _ => + } +} +sc.addSparkListener(listener) + +spark = SparkSession.builder() + .sparkContext(sc) + .getOrCreate() + +checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, "checkpoints").toString +sc.setCheckpointDir(checkpointDir) --- End diff -- I may miss it, where do we use checkpoint? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003563 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala --- @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) --- End diff -- Is there any `fit` method which doesn't do `super.fit()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r23747 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, } } - protected def transformImpl(dataset: Dataset[_]): DataFrame = { + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { --- End diff -- `transformImpl` for some abstraction and `saveImpl` are already existent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23263 [SPARK-23674][ML] Adds Spark ML Events ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track and add some actions for them. ## Introduction This PR proposes to send some ML events like SQL. This is quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) I think not to mention how useful it is to track the SQL operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) In `events.scala`, it adds: ```scala @Unstable case class ...StartEvent(caller, input) @Unstable case class ...EndEvent(caller, output) object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1. Catalog events (see `org.apache.spark.sql.catalyst.catalog.events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here **2. SQL execution events (see `org.apache.spark.sql.execution.SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) } + def fitImpl(...): Model ``` **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl(...) } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176 Inherited ones are intentionally omitted here for simplicity. They are inherited and implemented at multiple places. ## Backward Compatibility _This keeps both source and binary backward compatibility_. I was thinking enforcing `...Impl` by leaving it abstract methods to force to implement but just decided to leave a body that throws `UnsupportedOperationException` so that we can keep full source and binary compatibilities. - For user-faced API perspective, _there's no difference_.