[spark] branch branch-3.0 updated: [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-22 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 85c9e8c  [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in 
CrossValidatorModel.copy(), read() and write()
85c9e8c is described below

commit 85c9e8c54c30b69c39075e97cd3cac295be09303
Author: Louiszr 
AuthorDate: Sat Aug 22 09:27:31 2020 -0500

[SPARK-32092][ML][PYSPARK] Fix parameters not being copied in 
CrossValidatorModel.copy(), read() and write()

### What changes were proposed in this pull request?

Changed the definitions of 
`CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters 
(i.e. parameters with `get()` methods) are copied in these methods.

### Why are the changes needed?

Parameters are copied in the respective Scala interface for 
`CrossValidatorModel.copy()`.
It fits the semantics to persist parameters when calling 
`CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user 
gets the same model by saving and loading it after. Not copying across 
`numFolds` also causes bugs like Array index out of bound and losing sub-models 
because this parameters will always default to 3 (as described in the JIRA 
ticket).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so 
that they check parameters before and after function calls.

Closes #29445 from Louiszr/master.

Authored-by: Louiszr 
Signed-off-by: Sean Owen 
(cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252)
Signed-off-by: Sean Owen 
---
 python/pyspark/ml/tests/test_tuning.py | 131 ++---
 python/pyspark/ml/tuning.py|  67 +
 2 files changed, 172 insertions(+), 26 deletions(-)

diff --git a/python/pyspark/ml/tests/test_tuning.py 
b/python/pyspark/ml/tests/test_tuning.py
index 6bcc3f9..b250740 100644
--- a/python/pyspark/ml/tests/test_tuning.py
+++ b/python/pyspark/ml/tests/test_tuning.py
@@ -89,15 +89,50 @@ class CrossValidatorTests(SparkSessionTestCase):
 grid = (ParamGridBuilder()
 .addGrid(iee.inducedError, [100.0, 0.0, 1.0])
 .build())
-cv = CrossValidator(estimator=iee, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv = CrossValidator(
+estimator=iee,
+estimatorParamMaps=grid,
+evaluator=evaluator,
+collectSubModels=True,
+numFolds=2
+)
 cvCopied = cv.copy()
-self.assertEqual(cv.getEstimator().uid, cvCopied.getEstimator().uid)
+for param in [
+lambda x: x.getEstimator().uid,
+# SPARK-32092: CrossValidator.copy() needs to copy all existing 
params
+lambda x: x.getNumFolds(),
+lambda x: x.getFoldCol(),
+lambda x: x.getCollectSubModels(),
+lambda x: x.getParallelism(),
+lambda x: x.getSeed()
+]:
+self.assertEqual(param(cv), param(cvCopied))
 
 cvModel = cv.fit(dataset)
 cvModelCopied = cvModel.copy()
 for index in range(len(cvModel.avgMetrics)):
 self.assertTrue(abs(cvModel.avgMetrics[index] - 
cvModelCopied.avgMetrics[index])
 < 0.0001)
+# SPARK-32092: CrossValidatorModel.copy() needs to copy all existing 
params
+for param in [
+lambda x: x.getNumFolds(),
+lambda x: x.getFoldCol(),
+lambda x: x.getSeed()
+]:
+self.assertEqual(param(cvModel), param(cvModelCopied))
+
+cvModel.avgMetrics[0] = 'foo'
+self.assertNotEqual(
+cvModelCopied.avgMetrics[0],
+'foo',
+"Changing the original avgMetrics should not affect the copied 
model"
+)
+cvModel.subModels[0] = 'foo'
+self.assertNotEqual(
+cvModelCopied.subModels[0],
+'foo',
+"Changing the original subModels should not affect the copied 
model"
+)
 
 def test_fit_minimize_metric(self):
 dataset = self.spark.createDataFrame([
@@ -166,16 +201,39 @@ class CrossValidatorTests(SparkSessionTestCase):
 lr = LogisticRegression()
 grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
 evaluator = BinaryClassificationEvaluator()
-cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv = CrossValidator(
+estimator=lr,
+estimatorParamMaps=grid,
+evaluator=evaluator,
+collectSubModels=True,
+numFolds=4,
+seed=42
+)
 cvModel = cv.fit(dataset)
 

[spark] branch branch-3.0 updated: [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-22 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 85c9e8c  [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in 
CrossValidatorModel.copy(), read() and write()
85c9e8c is described below

commit 85c9e8c54c30b69c39075e97cd3cac295be09303
Author: Louiszr 
AuthorDate: Sat Aug 22 09:27:31 2020 -0500

[SPARK-32092][ML][PYSPARK] Fix parameters not being copied in 
CrossValidatorModel.copy(), read() and write()

### What changes were proposed in this pull request?

Changed the definitions of 
`CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters 
(i.e. parameters with `get()` methods) are copied in these methods.

### Why are the changes needed?

Parameters are copied in the respective Scala interface for 
`CrossValidatorModel.copy()`.
It fits the semantics to persist parameters when calling 
`CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user 
gets the same model by saving and loading it after. Not copying across 
`numFolds` also causes bugs like Array index out of bound and losing sub-models 
because this parameters will always default to 3 (as described in the JIRA 
ticket).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so 
that they check parameters before and after function calls.

Closes #29445 from Louiszr/master.

Authored-by: Louiszr 
Signed-off-by: Sean Owen 
(cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252)
Signed-off-by: Sean Owen 
---
 python/pyspark/ml/tests/test_tuning.py | 131 ++---
 python/pyspark/ml/tuning.py|  67 +
 2 files changed, 172 insertions(+), 26 deletions(-)

diff --git a/python/pyspark/ml/tests/test_tuning.py 
b/python/pyspark/ml/tests/test_tuning.py
index 6bcc3f9..b250740 100644
--- a/python/pyspark/ml/tests/test_tuning.py
+++ b/python/pyspark/ml/tests/test_tuning.py
@@ -89,15 +89,50 @@ class CrossValidatorTests(SparkSessionTestCase):
 grid = (ParamGridBuilder()
 .addGrid(iee.inducedError, [100.0, 0.0, 1.0])
 .build())
-cv = CrossValidator(estimator=iee, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv = CrossValidator(
+estimator=iee,
+estimatorParamMaps=grid,
+evaluator=evaluator,
+collectSubModels=True,
+numFolds=2
+)
 cvCopied = cv.copy()
-self.assertEqual(cv.getEstimator().uid, cvCopied.getEstimator().uid)
+for param in [
+lambda x: x.getEstimator().uid,
+# SPARK-32092: CrossValidator.copy() needs to copy all existing 
params
+lambda x: x.getNumFolds(),
+lambda x: x.getFoldCol(),
+lambda x: x.getCollectSubModels(),
+lambda x: x.getParallelism(),
+lambda x: x.getSeed()
+]:
+self.assertEqual(param(cv), param(cvCopied))
 
 cvModel = cv.fit(dataset)
 cvModelCopied = cvModel.copy()
 for index in range(len(cvModel.avgMetrics)):
 self.assertTrue(abs(cvModel.avgMetrics[index] - 
cvModelCopied.avgMetrics[index])
 < 0.0001)
+# SPARK-32092: CrossValidatorModel.copy() needs to copy all existing 
params
+for param in [
+lambda x: x.getNumFolds(),
+lambda x: x.getFoldCol(),
+lambda x: x.getSeed()
+]:
+self.assertEqual(param(cvModel), param(cvModelCopied))
+
+cvModel.avgMetrics[0] = 'foo'
+self.assertNotEqual(
+cvModelCopied.avgMetrics[0],
+'foo',
+"Changing the original avgMetrics should not affect the copied 
model"
+)
+cvModel.subModels[0] = 'foo'
+self.assertNotEqual(
+cvModelCopied.subModels[0],
+'foo',
+"Changing the original subModels should not affect the copied 
model"
+)
 
 def test_fit_minimize_metric(self):
 dataset = self.spark.createDataFrame([
@@ -166,16 +201,39 @@ class CrossValidatorTests(SparkSessionTestCase):
 lr = LogisticRegression()
 grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
 evaluator = BinaryClassificationEvaluator()
-cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv = CrossValidator(
+estimator=lr,
+estimatorParamMaps=grid,
+evaluator=evaluator,
+collectSubModels=True,
+numFolds=4,
+seed=42
+)
 cvModel = cv.fit(dataset)
 

[spark] branch branch-3.0 updated: [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-22 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 85c9e8c  [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in 
CrossValidatorModel.copy(), read() and write()
85c9e8c is described below

commit 85c9e8c54c30b69c39075e97cd3cac295be09303
Author: Louiszr 
AuthorDate: Sat Aug 22 09:27:31 2020 -0500

[SPARK-32092][ML][PYSPARK] Fix parameters not being copied in 
CrossValidatorModel.copy(), read() and write()

### What changes were proposed in this pull request?

Changed the definitions of 
`CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters 
(i.e. parameters with `get()` methods) are copied in these methods.

### Why are the changes needed?

Parameters are copied in the respective Scala interface for 
`CrossValidatorModel.copy()`.
It fits the semantics to persist parameters when calling 
`CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user 
gets the same model by saving and loading it after. Not copying across 
`numFolds` also causes bugs like Array index out of bound and losing sub-models 
because this parameters will always default to 3 (as described in the JIRA 
ticket).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so 
that they check parameters before and after function calls.

Closes #29445 from Louiszr/master.

Authored-by: Louiszr 
Signed-off-by: Sean Owen 
(cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252)
Signed-off-by: Sean Owen 
---
 python/pyspark/ml/tests/test_tuning.py | 131 ++---
 python/pyspark/ml/tuning.py|  67 +
 2 files changed, 172 insertions(+), 26 deletions(-)

diff --git a/python/pyspark/ml/tests/test_tuning.py 
b/python/pyspark/ml/tests/test_tuning.py
index 6bcc3f9..b250740 100644
--- a/python/pyspark/ml/tests/test_tuning.py
+++ b/python/pyspark/ml/tests/test_tuning.py
@@ -89,15 +89,50 @@ class CrossValidatorTests(SparkSessionTestCase):
 grid = (ParamGridBuilder()
 .addGrid(iee.inducedError, [100.0, 0.0, 1.0])
 .build())
-cv = CrossValidator(estimator=iee, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv = CrossValidator(
+estimator=iee,
+estimatorParamMaps=grid,
+evaluator=evaluator,
+collectSubModels=True,
+numFolds=2
+)
 cvCopied = cv.copy()
-self.assertEqual(cv.getEstimator().uid, cvCopied.getEstimator().uid)
+for param in [
+lambda x: x.getEstimator().uid,
+# SPARK-32092: CrossValidator.copy() needs to copy all existing 
params
+lambda x: x.getNumFolds(),
+lambda x: x.getFoldCol(),
+lambda x: x.getCollectSubModels(),
+lambda x: x.getParallelism(),
+lambda x: x.getSeed()
+]:
+self.assertEqual(param(cv), param(cvCopied))
 
 cvModel = cv.fit(dataset)
 cvModelCopied = cvModel.copy()
 for index in range(len(cvModel.avgMetrics)):
 self.assertTrue(abs(cvModel.avgMetrics[index] - 
cvModelCopied.avgMetrics[index])
 < 0.0001)
+# SPARK-32092: CrossValidatorModel.copy() needs to copy all existing 
params
+for param in [
+lambda x: x.getNumFolds(),
+lambda x: x.getFoldCol(),
+lambda x: x.getSeed()
+]:
+self.assertEqual(param(cvModel), param(cvModelCopied))
+
+cvModel.avgMetrics[0] = 'foo'
+self.assertNotEqual(
+cvModelCopied.avgMetrics[0],
+'foo',
+"Changing the original avgMetrics should not affect the copied 
model"
+)
+cvModel.subModels[0] = 'foo'
+self.assertNotEqual(
+cvModelCopied.subModels[0],
+'foo',
+"Changing the original subModels should not affect the copied 
model"
+)
 
 def test_fit_minimize_metric(self):
 dataset = self.spark.createDataFrame([
@@ -166,16 +201,39 @@ class CrossValidatorTests(SparkSessionTestCase):
 lr = LogisticRegression()
 grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
 evaluator = BinaryClassificationEvaluator()
-cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator)
+cv = CrossValidator(
+estimator=lr,
+estimatorParamMaps=grid,
+evaluator=evaluator,
+collectSubModels=True,
+numFolds=4,
+seed=42
+)
 cvModel = cv.fit(dataset)