[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121734092
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -65,6 +67,12 @@ private[ml] trait OneVsRestParams extends 
PredictorParams with ClassifierTypeTra
 
   /** @group getParam */
   def getClassifier: ClassifierType = $(classifier)
+
+  val parallelism = new IntParam(this, "parallelism",
+"parallelism parameter for tuning amount of parallelism", 
ParamValidators.gtEq(1))
--- End diff --

Also
* add Scala doc
* add Since annotation
* Mark this as an expertParam (the "group" annotation in the Scala doc); 
see other parts of the code for examples


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121737343
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1229,7 +1229,35 @@ def test_output_columns(self):
  (2.0, Vectors.dense(0.5, 0.5))],
 ["label", "features"])
 lr = LogisticRegression(maxIter=5, regParam=0.01)
-ovr = OneVsRest(classifier=lr)
+ovr = OneVsRest(classifier=lr, parallelism=1)
+model = ovr.fit(df)
+output = model.transform(df)
+self.assertEqual(output.columns, ["label", "features", 
"prediction"])
+
+
+class ParOneVsRestTests(SparkSessionTestCase):
+
+def test_copy(self):
--- End diff --

What is this testing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121735342
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -101,6 +101,40 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(expectedMetrics.confusionMatrix ~== ovaMetrics.confusionMatrix 
absTol 400)
   }
 
+  test("one-vs-rest: tuning parallelism produces correct output") {
+val numClasses = 3
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+  .setParallelism(8)
+assert(ova.getLabelCol === "label")
--- End diff --

No need to check Params.  That's not what this unit test is for.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121735271
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -101,6 +101,40 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(expectedMetrics.confusionMatrix ~== ovaMetrics.confusionMatrix 
absTol 400)
   }
 
+  test("one-vs-rest: tuning parallelism produces correct output") {
+val numClasses = 3
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+  .setParallelism(8)
+assert(ova.getLabelCol === "label")
+assert(ova.getPredictionCol === "prediction")
+val ovaModel = ova.fit(dataset)
+
+MLTestingUtils.checkCopyAndUids(ova, ovaModel)
--- End diff --

This is a generic test which only needs to be done in 1 test for each 
algorithm.  You can remove it here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121737455
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1229,7 +1229,35 @@ def test_output_columns(self):
  (2.0, Vectors.dense(0.5, 0.5))],
 ["label", "features"])
 lr = LogisticRegression(maxIter=5, regParam=0.01)
-ovr = OneVsRest(classifier=lr)
+ovr = OneVsRest(classifier=lr, parallelism=1)
+model = ovr.fit(df)
+output = model.transform(df)
+self.assertEqual(output.columns, ["label", "features", 
"prediction"])
+
+
+class ParOneVsRestTests(SparkSessionTestCase):
+
+def test_copy(self):
+df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),
+ (1.0, Vectors.sparse(2, [], [])),
+ (2.0, Vectors.dense(0.5, 0.5))],
+["label", "features"])
+lr = LogisticRegression(maxIter=5, regParam=0.01)
+ovr = OneVsRest(classifier=lr, parallelism=8)
+ovr1 = ovr.copy({lr.maxIter: 10})
+self.assertEqual(ovr.getClassifier().getMaxIter(), 5)
+self.assertEqual(ovr1.getClassifier().getMaxIter(), 10)
+model = ovr.fit(df)
+model1 = model.copy({model.predictionCol: "indexed"})
+self.assertEqual(model1.getPredictionCol(), "indexed")
+
+def test_output_columns(self):
--- End diff --

Ditto: is this needed?

How about adding a test like the one I proposed for Scala, which makes sure 
the same model is learned regardless of parallelism?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121736656
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1560,14 +1566,27 @@ def trainSingleClass(index):
 (classifier.predictionCol, predictionCol)])
 return classifier.fit(trainingDataset, paramMap)
 
-# TODO: Parallel training for all classes.
-models = [trainSingleClass(i) for i in range(numClasses)]
+pool = ThreadPool(processes=self.getParallelism())
+
+models = pool.map(trainSingleClass, range(numClasses))
 
 if handlePersistence:
 multiclassLabeled.unpersist()
 
 return self._copyValues(OneVsRestModel(models=models))
 
+def setParallelism(self, value):
--- End diff --

Add Since annotations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121735870
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -101,6 +101,40 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(expectedMetrics.confusionMatrix ~== ovaMetrics.confusionMatrix 
absTol 400)
   }
 
+  test("one-vs-rest: tuning parallelism produces correct output") {
--- End diff --

"produces correct output" --> "does not affect output"

This test appears to check OVR vs. another algorithm.  I think a more 
precise test would check that tuning parallelism still produces exactly the 
same models.  Could you please update it to do so?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121736014
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1510,21 +1511,26 @@ class OneVsRest(Estimator, OneVsRestParams, 
MLReadable, MLWritable):
 
 .. versionadded:: 2.0.0
 """
+parallelism = Param(Params._dummy(), "parallelism",
+"Number of models to fit in parallel",
--- End diff --

When you update the doc in Scala, you can update it here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121740343
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -325,8 +343,13 @@ final class OneVsRest @Since("1.4.0") (
   multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
 }
 
+val iters = Range(0, numClasses).par
--- End diff --

CC @thunterdb just to double-check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121733736
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -65,6 +67,12 @@ private[ml] trait OneVsRestParams extends 
PredictorParams with ClassifierTypeTra
 
   /** @group getParam */
   def getClassifier: ClassifierType = $(classifier)
+
+  val parallelism = new IntParam(this, "parallelism",
+"parallelism parameter for tuning amount of parallelism", 
ParamValidators.gtEq(1))
--- End diff --

This is not very informative.  Can you please make it more explicit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121734558
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -283,6 +295,12 @@ final class OneVsRest @Since("1.4.0") (
   }
 
   /** @group setParam */
+  @Since("1.4.0")
--- End diff --

next release will be 2.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121737179
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1560,14 +1566,27 @@ def trainSingleClass(index):
 (classifier.predictionCol, predictionCol)])
 return classifier.fit(trainingDataset, paramMap)
 
-# TODO: Parallel training for all classes.
-models = [trainSingleClass(i) for i in range(numClasses)]
+pool = ThreadPool(processes=self.getParallelism())
--- End diff --

One new thought: It'd be good to set processes to min(parallelism, 
numClasses).  Same in Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][SPARK-21028][ML][PYTHON] Added tuna...

2017-06-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121735491
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -101,6 +101,40 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(expectedMetrics.confusionMatrix ~== ovaMetrics.confusionMatrix 
absTol 400)
   }
 
+  test("one-vs-rest: tuning parallelism produces correct output") {
+val numClasses = 3
+val ova = new OneVsRest()
+  .setClassifier(new LogisticRegression)
+  .setParallelism(8)
+assert(ova.getLabelCol === "label")
--- End diff --

There are several things like that below; just go through and remove items 
which are not part of this unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org