[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200498918
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -148,12 +170,25 @@ private[spark] class Instrumentation[E <: 
Estimator[_]] private (
   }
 
 
+  // TODO: Remove this (possibly replace with logModel?)
   /**
* Logs the successful completion of the training session.
*/
   def logSuccess(model: Model[_]): Unit = {
 log(s"training finished")
   }
+
+  def logSuccess(): Unit = {
+log("training finished")
+  }
+
+  /**
+   * Logs an exception raised during a training session.
+   */
+  def logFailure(e: Throwable): Unit = {
+val msg = e.getStackTrace.mkString("\n")
+super.logInfo(msg)
--- End diff --

Failures should go to ERROR level.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200499164
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -169,22 +204,33 @@ private[spark] object Instrumentation {
 val varianceOfLabels = "varianceOfLabels"
   }
 
+  // TODO: Remove these
   /**
* Creates an instrumentation object for a training session.
*/
-  def create[E <: Estimator[_]](
-  estimator: E, dataset: Dataset[_]): Instrumentation[E] = {
-create[E](estimator, dataset.rdd)
+  def create(estimator: Estimator[_], dataset: Dataset[_]): 
Instrumentation = {
+create(estimator, dataset.rdd)
   }
 
   /**
* Creates an instrumentation object for a training session.
*/
-  def create[E <: Estimator[_]](
-  estimator: E, dataset: RDD[_]): Instrumentation[E] = {
-new Instrumentation[E](estimator, dataset)
+  def create(estimator: Estimator[_], dataset: RDD[_]): Instrumentation = {
+new Instrumentation(estimator, dataset)
+  }
+  // end remove
+
+  def instrumented[T](body: (Instrumentation => T)): T = {
+val instr = new Instrumentation()
+Try(body(new Instrumentation())) match {
--- End diff --

use already constructed `instr`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200497209
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -500,7 +500,7 @@ class LogisticRegression @Since("1.2.0") (
 
 if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
 
-val instr = Instrumentation.create(this, dataset)
+instr.logContext(this, dataset)
--- End diff --

It doesn't log anything. I think we should auto-generate `prefix` and keep 
it as a constant. So logs would appear as:

~~~
[PREFIX]: instrumentation started
[PREFIX]: using estimator logReg-abc128
[PREFIX]: using dataset some hashcode
[PREFIX]: param maxIter=10
[PREFIX]: ...
[PREFIX]: run succeeded/failed
[PREFIX]: instrumentation ended
~~~

We can generate 8 random chars as the PREFIX. This is sufficient for 
correlate metrics from the same run. The issue with making it mutable is that 
we do not have a way to guarantee `logContext` is always called.

So I would suggest replacing logContext with the following:

* logEstimator or logPipelineStage
* logDataset

Btw, we can by default log call site. It provides more info for 
instrumentation, not necessary in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200498732
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -148,12 +170,25 @@ private[spark] class Instrumentation[E <: 
Estimator[_]] private (
   }
 
 
+  // TODO: Remove this (possibly replace with logModel?)
   /**
* Logs the successful completion of the training session.
*/
   def logSuccess(model: Model[_]): Unit = {
 log(s"training finished")
   }
+
+  def logSuccess(): Unit = {
+log("training finished")
--- End diff --

We shouldn't have this `log` alias. I was wondering which log level it 
uses. Just use `logInfo` and remove `log(`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200499315
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -169,22 +204,33 @@ private[spark] object Instrumentation {
 val varianceOfLabels = "varianceOfLabels"
   }
 
+  // TODO: Remove these
   /**
* Creates an instrumentation object for a training session.
*/
-  def create[E <: Estimator[_]](
-  estimator: E, dataset: Dataset[_]): Instrumentation[E] = {
-create[E](estimator, dataset.rdd)
+  def create(estimator: Estimator[_], dataset: Dataset[_]): 
Instrumentation = {
+create(estimator, dataset.rdd)
   }
 
   /**
* Creates an instrumentation object for a training session.
*/
-  def create[E <: Estimator[_]](
-  estimator: E, dataset: RDD[_]): Instrumentation[E] = {
-new Instrumentation[E](estimator, dataset)
+  def create(estimator: Estimator[_], dataset: RDD[_]): Instrumentation = {
+new Instrumentation(estimator, dataset)
+  }
+  // end remove
+
+  def instrumented[T](body: (Instrumentation => T)): T = {
+val instr = new Instrumentation()
+Try(body(new Instrumentation())) match {
+  case Failure(NonFatal(e)) =>
+instr.logFailure(e)
+throw e
+  case Success(model) =>
--- End diff --

`model` -> `result`, it doesn't need to be a model


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200497861
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -35,32 +36,47 @@ import org.apache.spark.util.Utils
 /**
  * A small wrapper that defines a training session for an estimator, and 
some methods to log
  * useful information during this session.
- *
- * A new instance is expected to be created within fit().
- *
- * @param estimator the estimator that is being fit
- * @param dataset the training dataset
- * @tparam E the type of the estimator
  */
-private[spark] class Instrumentation[E <: Estimator[_]] private (
-val estimator: E,
-val dataset: RDD[_]) extends Logging {
+private[spark] class Instrumentation extends Logging {
 
   private val id = UUID.randomUUID()
-  private val prefix = {
-// estimator.getClass.getSimpleName can cause Malformed class name 
error,
-// call safer `Utils.getSimpleName` instead
-val className = Utils.getSimpleName(estimator.getClass)
-s"$className-${estimator.uid}-${dataset.hashCode()}-$id: "
+  private val shortId = id.toString.take(8)
+  private var prefix = s"$shortId:"
+
+  // TODO: update spark.ml to use new Instrumentation APIs and remove this 
constructor
+  var estimator: Estimator[_] = _
+  private def this(estimator: Estimator[_], dataset: RDD[_]) = {
+this()
+logContext(estimator, dataset)
   }
 
-  init()
+  /**
+   * Log info about the estimator and dataset being fit.
+   *
+   * @param estimator the estimator that is being fit
+   * @param dataset the training dataset
+   */
+  def logContext(estimator: Estimator[_], dataset: RDD[_]): Unit = {
--- End diff --

see my comment above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r200495412
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -490,7 +490,7 @@ class LogisticRegression @Since("1.2.0") (
 
   protected[spark] def train(
   dataset: Dataset[_],
-  handlePersistence: Boolean): LogisticRegressionModel = {
+  handlePersistence: Boolean): LogisticRegressionModel = 
Instrumentation.instrumented { instr =>
--- End diff --

To avoid line too wide, we might want to import `instrumented` and save 
"Instrumentation" from this line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21719: [SPARK-24747] Make Instrumentation class more fle...

2018-07-05 Thread MrBago
GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/21719

[SPARK-24747] Make Instrumentation class more flexible

## What changes were proposed in this pull request?

This PR updates the Instrumentation class to make it more flexible and a 
little bit easier to use. When these APIs are merged, I'll followup with a PR 
to update the training code to use these new APIs so we can remove the old 
APIs. These changes are all to private APIs so this PR doesn't make any user 
facing changes.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark new-instrumentation-apis

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21719.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21719


commit 03c9b0a9fa664171c3a6a8264fe04b3488dae4e4
Author: Bago Amirbekian 
Date:   2018-06-25T22:52:18Z

Added Instrumentation.instrumented API with required changes to
Instrumentation class. Updated LogisticRegression to use this API as an
example.

commit 3a6537d2861c1f6dd65b717772eb4c1f2dc7c174
Author: Bago Amirbekian 
Date:   2018-07-05T18:16:16Z

Allow `instrumented` method to return any type.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org