[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-14 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r127548975
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala
 ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import java.io.File
+import java.nio.file.{Files, StandardCopyOption}
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.{ParamMap, ParamPair, Params}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, 
MLReader, MLWritable}
+
+object ValidatorParamsSuiteHelpers extends SparkFunSuite with 
DefaultReadWriteTest {
+  /**
+   * Assert sequences of estimatorParamMaps are identical.
+   * If the values for a parameter are not directly comparable with ===
+   * and are instead Params types themselves then their corresponding 
paramMaps
+   * are compared against each other.
+   */
+  def compareParamMaps(pMaps: Array[ParamMap], pMaps2: Array[ParamMap]): 
Unit = {
+assert(pMaps.length === pMaps2.length)
+pMaps.zip(pMaps2).foreach { case (pMap, pMap2) =>
+  assert(pMap.size === pMap2.size)
+  pMap.toSeq.foreach { case ParamPair(p, v) =>
+assert(pMap2.contains(p))
+val otherParam = pMap2(p)
+v match {
+  case estimator: Params =>
+otherParam match {
+  case estimator2: Params =>
+val estimatorParamMap = Array(estimator.extractParamMap())
+val estimatorParamMap2 = 
Array(estimator2.extractParamMap())
+compareParamMaps(estimatorParamMap, estimatorParamMap2)
+  case other =>
+throw new AssertionError(s"Expected parameter of type 
Params but" +
+  s" found ${otherParam.getClass.getName}")
+}
+  case _ =>
+assert(otherParam === v)
+}
+  }
+}
+  }
+
+  /**
+   * When nested estimators (ex. OneVsRest) are saved within 
meta-algorithms such as
+   * CrossValidator and TrainValidationSplit, relative paths should be 
used to store
+   * the path of the estimator so that if the parent directory changes, 
loading the
+   * model still works.
+   */
+  def testFileMove[T <: Params with MLWritable](instance: T): Unit = {
+val uid = instance.uid
+val subdirName = Identifiable.randomUID("test")
+
+val subdir = new File(tempDir, subdirName)
+val subDirWithUid = new File(subdir, uid)
+
+instance.save(subDirWithUid.getPath)
+
+val newSubdirName = Identifiable.randomUID("test_moved")
+val newSubdir = new File(tempDir, newSubdirName)
+val newSubdirWithUid = new File(newSubdir, uid)
+
+Files.createDirectory(newSubdir.toPath)
+Files.createDirectory(newSubdirWithUid.toPath)
+Files.move(subDirWithUid.toPath, newSubdirWithUid.toPath, 
StandardCopyOption.ATOMIC_MOVE)
+
+val loader = instance.getClass.getMethod("read")
+  .invoke(null).asInstanceOf[MLReader[T]]
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r127057083
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -183,8 +198,15 @@ private[ml] object ValidatorParams {
   val paramPairs = pMap.map { case pInfo: Map[String, String] =>
 val est = uidToParams(pInfo("parent"))
 val param = est.getParam(pInfo("name"))
-val value = param.jsonDecode(pInfo("value"))
-param -> value
+if (!pInfo.contains("isJson") ||
+   (pInfo.contains("isJson") && 
pInfo("isJson").toBoolean.booleanValue())) {
--- End diff --

style nit: indent line 202 +1 space

Also, could you please add a comment saying that SPARK-21221 introduced the 
"isJson" field?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r126849117
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -183,8 +198,14 @@ private[ml] object ValidatorParams {
   val paramPairs = pMap.map { case pInfo: Map[String, String] =>
 val est = uidToParams(pInfo("parent"))
 val param = est.getParam(pInfo("name"))
-val value = param.jsonDecode(pInfo("value"))
-param -> value
+if (pInfo("isJson").toBoolean.booleanValue()) {
--- End diff --

I *think* fixing backwards compatibility will just mean testing for whether 
the field "isJson" is present here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-06 Thread ajaysaini725
Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r126014344
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1630,8 +1614,52 @@ def _to_java(self):
 _java_obj.setPredictionCol(self.getPredictionCol())
 return _java_obj
 
+def _make_java_param_pair(self, param, value):
--- End diff --

I tried making OneVsRest extend JavaParams but for some reason when I set 
self._java_obj in the __init__() function as is done with other classes that 
inherit from JavaParams I get an error from the make_java_param_pair() function 
in wrapper.py that says that _java_obj does not have an id. From what I can 
tell the implementation is identical to that of other estimators so I'm not 
sure what's wrong. To avoid spending too much time on it, I pushed an update 
with all changes made except for this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176264
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -681,6 +682,76 @@ def test_save_load(self):
 self.assertEqual(loadedLrModel.uid, lrModel.uid)
 self.assertEqual(loadedLrModel.intercept, lrModel.intercept)
 
+def test_save_load_simple_estimator(self):
+temp_path = tempfile.mkdtemp()
+dataset = self.spark.createDataFrame(
+[(Vectors.dense([0.0]), 0.0),
+ (Vectors.dense([0.4]), 1.0),
+ (Vectors.dense([0.5]), 0.0),
+ (Vectors.dense([0.6]), 1.0),
+ (Vectors.dense([1.0]), 1.0)] * 10,
+["features", "label"])
+
+lr = LogisticRegression()
+grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
+evaluator = BinaryClassificationEvaluator()
+
+# test save/load of CrossValidator
+cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cvModel = cv.fit(dataset)
+cvPath = temp_path + "/cv"
+cv.save(cvPath)
+loadedCV = CrossValidator.load(cvPath)
+self.assertEqual(loadedCV.getEstimator().uid, 
cv.getEstimator().uid)
+self.assertEqual(loadedCV.getEvaluator().uid, 
cv.getEvaluator().uid)
+self.assertEqual(loadedCV.getEstimatorParamMaps(), 
cv.getEstimatorParamMaps())
+
+# test save/load of CrossValidatorModel
+cvModelPath = temp_path + "/cvModel"
+cvModel.save(cvModelPath)
+loadedModel = CrossValidatorModel.load(cvModelPath)
+self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid)
+
+def test_save_load_nested_stimator(self):
--- End diff --

fix typo "stimator"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176348
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -263,8 +301,60 @@ def copy(self, extra=None):
 newCV.setEvaluator(self.getEvaluator().copy(extra))
 return newCV
 
+@since("2.3.0")
+def write(self):
+"""Returns an MLWriter instance for this ML instance."""
+return JavaMLWriter(self)
+
+@since("2.3.0")
+def save(self, path):
+"""Save this ML instance to the given path, a shortcut of 
`write().save(path)`."""
+self.write().save(path)
+
+@classmethod
+@since("2.3.0")
+def read(cls):
+"""Returns an MLReader instance for this class."""
+return JavaMLReader(cls)
+
+@classmethod
+@since("2.3.0")
--- End diff --

Don't add since annotations to private methods


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176320
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -137,8 +140,43 @@ def getEvaluator(self):
 """
 return self.getOrDefault(self.evaluator)
 
+def getEvaluator(self):
--- End diff --

duplicate of above method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176253
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1646,6 +1674,15 @@ class OneVsRestModel(Model, OneVsRestParams, 
MLReadable, MLWritable):
 def __init__(self, models):
 super(OneVsRestModel, self).__init__()
 self.models = models
+java_models = [model._to_java() for model in self.models]
+sc = SparkContext._active_spark_context
+java_models_array = JavaWrapper._new_java_array(java_models,
+
sc._gateway.jvm.org.apache.spark.ml
+
.classification.ClassificationModel)
+metadata = 
JavaParams._new_java_obj("org.apache.spark.sql.types.Metadata")
--- End diff --

Leave a TODO indicating that we need to set metadata.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125175867
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1630,8 +1614,52 @@ def _to_java(self):
 _java_obj.setPredictionCol(self.getPredictionCol())
 return _java_obj
 
+def _make_java_param_pair(self, param, value):
+"""
+Makes a Java parm pair.
--- End diff --

correct typo: parm -> param (and in original please)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125175232
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala 
---
@@ -134,6 +134,59 @@ class TrainValidationSplitSuite
 
 assert(tvs.getTrainRatio === tvs2.getTrainRatio)
 assert(tvs.getSeed === tvs2.getSeed)
+
+TrainValidationSplitSuite
+  .compareParamMaps(tvs.getEstimatorParamMaps, 
tvs2.getEstimatorParamMaps)
+
+tvs2.getEstimator match {
+  case lr2: LogisticRegression =>
+assert(lr.uid === lr2.uid)
+assert(lr.getMaxIter === lr2.getMaxIter)
+  case other =>
+throw new AssertionError(s"Loaded TrainValidationSplit expected 
estimator of type" +
+  s" LogisticRegression but found ${other.getClass.getName}")
+}
+  }
+
+  test("read/write: TrainValidationSplit with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+val evaluator = new BinaryClassificationEvaluator()
+  .setMetricName("areaUnderPR")  // not default metric
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
+  .build()
+val tvs = new TrainValidationSplit()
+  .setEstimator(ova)
+  .setEvaluator(evaluator)
+  .setTrainRatio(0.5)
+  .setEstimatorParamMaps(paramMaps)
+  .setSeed(42L)
+
+val tvs2 = testDefaultReadWrite(tvs, testParams = false)
+
+assert(tvs.getTrainRatio === tvs2.getTrainRatio)
+assert(tvs.getSeed === tvs2.getSeed)
+
+tvs2.getEstimator match {
+  case ova2: OneVsRest =>
+assert(ova.uid === ova2.uid)
+val classifier = ova2.getClassifier
+classifier match {
+  case lr: LogisticRegression =>
+
assert(ova.getClassifier.asInstanceOf[LogisticRegression].getMaxIter
+  === lr.asInstanceOf[LogisticRegression].getMaxIter)
--- End diff --

lr is already of type LogisticRegression (no need to cast)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176209
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1630,8 +1614,52 @@ def _to_java(self):
 _java_obj.setPredictionCol(self.getPredictionCol())
 return _java_obj
 
+def _make_java_param_pair(self, param, value):
--- End diff --

I'm not a big fan of copying these methods.  Does it work to use the parent 
versions of this method and _transfer_param_map_to_java if you set 
self._java_obj in this class' __init__ method?  I guess you'll have to override 
_transfer_param_map_from_java, but 1 is better than 3.

Note for other reviewers: This is a short/medium-term fix to permit 
persistence, but we'll change it eventually to permit Python-only base 
estimators.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176344
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -263,8 +301,60 @@ def copy(self, extra=None):
 newCV.setEvaluator(self.getEvaluator().copy(extra))
 return newCV
 
+@since("2.3.0")
+def write(self):
+"""Returns an MLWriter instance for this ML instance."""
+return JavaMLWriter(self)
+
+@since("2.3.0")
+def save(self, path):
--- End diff --

no need to copy this here; it can use the one in MLWritable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125175799
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,52 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest().setClassifier(new LogisticRegression)
+val evaluator = new MulticlassClassificationEvaluator()
+  .setMetricName("accuracy")
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+// params that are not JSON serializable must inherit from Params
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
+  .build()
+val cv = new CrossValidator()
+  .setEstimator(ova)
+  .setEvaluator(evaluator)
+  .setNumFolds(20)
+  .setEstimatorParamMaps(paramMaps)
--- End diff --

Please compare the original + the loaded estimatorParamMaps


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125176362
  
--- Diff: python/pyspark/ml/wrapper.py ---
@@ -111,7 +111,14 @@ def _make_java_param_pair(self, param, value):
 sc = SparkContext._active_spark_context
 param = self._resolveParam(param)
 java_param = self._java_obj.getParam(param.name)
-java_value = _py2java(sc, value)
+if isinstance(value, Estimator) or isinstance(value, Model):
--- End diff --

check for instances of JavaParams instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125175225
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,52 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest().setClassifier(new LogisticRegression)
+val evaluator = new MulticlassClassificationEvaluator()
+  .setMetricName("accuracy")
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+// params that are not JSON serializable must inherit from Params
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
+  .build()
+val cv = new CrossValidator()
+  .setEstimator(ova)
+  .setEvaluator(evaluator)
+  .setNumFolds(20)
+  .setEstimatorParamMaps(paramMaps)
+
+val cv2 = testDefaultReadWrite(cv, testParams = false)
+
+assert(cv.uid === cv2.uid)
+assert(cv.getNumFolds === cv2.getNumFolds)
+assert(cv.getSeed === cv2.getSeed)
+
+
assert(cv2.getEvaluator.isInstanceOf[MulticlassClassificationEvaluator])
+val evaluator2 = 
cv2.getEvaluator.asInstanceOf[MulticlassClassificationEvaluator]
+assert(evaluator.uid === evaluator2.uid)
+assert(evaluator.getMetricName === evaluator2.getMetricName)
+
+cv2.getEstimator match {
+  case ova2: OneVsRest =>
+assert(ova.uid === ova2.uid)
+val classifier = ova2.getClassifier
+classifier match {
+  case lr: LogisticRegression =>
+
assert(ova.getClassifier.asInstanceOf[LogisticRegression].getMaxIter
+  === lr.asInstanceOf[LogisticRegression].getMaxIter)
--- End diff --

lr is already of type LogisticRegression (no need to cast)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125175803
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,52 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest().setClassifier(new LogisticRegression)
+val evaluator = new MulticlassClassificationEvaluator()
+  .setMetricName("accuracy")
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+// params that are not JSON serializable must inherit from Params
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
+  .build()
+val cv = new CrossValidator()
+  .setEstimator(ova)
+  .setEvaluator(evaluator)
+  .setNumFolds(20)
+  .setEstimatorParamMaps(paramMaps)
--- End diff --

Same for the TrainValidationSplitSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-07-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r125175405
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala 
---
@@ -160,8 +213,21 @@ class TrainValidationSplitSuite
   }
 }
 
-object TrainValidationSplitSuite {
-
+object TrainValidationSplitSuite extends SparkFunSuite{
+  /**
+   * Assert sequences of estimatorParamMaps are identical.
+   * Params must be simple types comparable with `===`.
+   */
+  def compareParamMaps(pMaps: Array[ParamMap], pMaps2: Array[ParamMap]): 
Unit = {
--- End diff --

If this is the same as in CrossValidatorSuite, then can you please move 
them to a shared file (maybe ValidatorParamsSuite)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124184937
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -126,10 +126,22 @@ private[ml] object ValidatorParams {
   extraMetadata: Option[JObject] = None): Unit = {
 import org.json4s.JsonDSL._
 
+var numParamsNotJson = 0
 val estimatorParamMapsJson = compact(render(
   instance.getEstimatorParamMaps.map { case paramMap =>
 paramMap.toSeq.map { case ParamPair(p, v) =>
-  Map("parent" -> p.parent, "name" -> p.name, "value" -> 
p.jsonEncode(v))
+  v match {
+case writeableObj: MLWritable =>
+  numParamsNotJson += 1
--- End diff --

nit: move this down 1 line to index from 0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124161422
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,46 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+val evaluator = new BinaryClassificationEvaluator()
+  .setMetricName("areaUnderPR")  // not default metric
+
--- End diff --

style: remove extra newline


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124185775
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -183,8 +195,14 @@ private[ml] object ValidatorParams {
   val paramPairs = pMap.map { case pInfo: Map[String, String] =>
 val est = uidToParams(pInfo("parent"))
 val param = est.getParam(pInfo("name"))
-val value = param.jsonDecode(pInfo("value"))
-param -> value
+if (pInfo("isJson").toBoolean.booleanValue()) {
+  val value = param.jsonDecode(pInfo("value"))
+  param -> value
+} else {
+  val path = param.jsonDecode(pInfo("value")).toString
+  val value = 
DefaultParamsReader.loadParamsInstance[MLWritable](path, sc)
--- End diff --

This is OK with me for now since it will address all cases I've seen.  In 
the future, it'd be great to make this more general by allowing it to read any 
MLReadable type (not just DefaultParamsReadable).  I'll comment in the save() 
section above about this too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124185314
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -126,10 +126,22 @@ private[ml] object ValidatorParams {
   extraMetadata: Option[JObject] = None): Unit = {
 import org.json4s.JsonDSL._
 
+var numParamsNotJson = 0
 val estimatorParamMapsJson = compact(render(
   instance.getEstimatorParamMaps.map { case paramMap =>
 paramMap.toSeq.map { case ParamPair(p, v) =>
-  Map("parent" -> p.parent, "name" -> p.name, "value" -> 
p.jsonEncode(v))
+  v match {
+case writeableObj: MLWritable =>
+  numParamsNotJson += 1
+  val paramPath = new Path(path, "param" + p.name + 
numParamsNotJson).toString
--- End diff --

How about changing the prefix "param" -> "epm_"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124161463
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,46 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+val evaluator = new BinaryClassificationEvaluator()
+  .setMetricName("areaUnderPR")  // not default metric
--- End diff --

Is this needed for this unit test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124180242
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala 
---
@@ -136,6 +136,29 @@ class TrainValidationSplitSuite
 assert(tvs.getSeed === tvs2.getSeed)
   }
 
+  test("read/write: TrainValidationSplit with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+val evaluator = new BinaryClassificationEvaluator()
+  .setMetricName("areaUnderPR")  // not default metric
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
+  .build()
+val tvs = new TrainValidationSplit()
+  .setEstimator(ova)
+  .setEvaluator(evaluator)
+  .setTrainRatio(0.5)
+  .setEstimatorParamMaps(paramMaps)
+  .setSeed(42L)
+
+val tvs2 = testDefaultReadWrite(tvs, testParams = false)
+
+assert(tvs.getTrainRatio === tvs2.getTrainRatio)
+assert(tvs.getSeed === tvs2.getSeed)
--- End diff --

check classifier in paramMaps too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124168033
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,46 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+val evaluator = new BinaryClassificationEvaluator()
+  .setMetricName("areaUnderPR")  // not default metric
+
+
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
+  .build()
+val cv = new CrossValidator()
+  .setEstimator(ova)
+  .setEvaluator(evaluator)
+  .setNumFolds(20)
+  .setEstimatorParamMaps(paramMaps)
+
+val cv2 = testDefaultReadWrite(cv, testParams = false)
+
+assert(cv.uid === cv2.uid)
+assert(cv.getNumFolds === cv2.getNumFolds)
+assert(cv.getSeed === cv2.getSeed)
+
+assert(cv2.getEvaluator.isInstanceOf[BinaryClassificationEvaluator])
+val evaluator2 = 
cv2.getEvaluator.asInstanceOf[BinaryClassificationEvaluator]
+assert(evaluator.uid === evaluator2.uid)
+assert(evaluator.getMetricName === evaluator2.getMetricName)
+
+cv2.getEstimator match {
+  case ova2: OneVsRest =>
+assert(ova.uid === ova2.uid)
+
assert(ova.getClassifier.asInstanceOf[LogisticRegression].getMaxIter
--- End diff --

Check type of classifier before casting


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124161468
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,46 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
--- End diff --

style: fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124185896
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -126,10 +126,22 @@ private[ml] object ValidatorParams {
   extraMetadata: Option[JObject] = None): Unit = {
 import org.json4s.JsonDSL._
 
+var numParamsNotJson = 0
 val estimatorParamMapsJson = compact(render(
   instance.getEstimatorParamMaps.map { case paramMap =>
 paramMap.toSeq.map { case ParamPair(p, v) =>
-  Map("parent" -> p.parent, "name" -> p.name, "value" -> 
p.jsonEncode(v))
+  v match {
+case writeableObj: MLWritable =>
--- End diff --

Per my comment below in the load() section, this should be restricted to 
DefaultParamsWritable for now.  Could you please do so, but also add a check 
which throws an error if `v` is MLWritable but not DefaultParamsWritable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18428: [Spark-21221][ML] CrossValidator and TrainValidat...

2017-06-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18428#discussion_r124161588
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -156,6 +156,46 @@ class CrossValidatorSuite
 CrossValidatorSuite.compareParamMaps(cv.getEstimatorParamMaps, 
cv2.getEstimatorParamMaps)
   }
 
+  test("read/write: CrossValidator with nested estimator") {
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+val evaluator = new BinaryClassificationEvaluator()
+  .setMetricName("areaUnderPR")  // not default metric
+
+
+val classifier1 = new LogisticRegression().setRegParam(2.0)
+val classifier2 = new LogisticRegression().setRegParam(3.0)
+val paramMaps = new ParamGridBuilder()
+  .addGrid(ova.classifier, Array(classifier1, classifier2))
--- End diff --

Add comment that it is important to test Param values which inherit from 
Params.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org