[spark] branch branch-3.0 updated: [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 85c9e8c [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() 85c9e8c is described below commit 85c9e8c54c30b69c39075e97cd3cac295be09303 Author: Louiszr AuthorDate: Sat Aug 22 09:27:31 2020 -0500 [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() ### What changes were proposed in this pull request? Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods. ### Why are the changes needed? Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`. It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls. Closes #29445 from Louiszr/master. Authored-by: Louiszr Signed-off-by: Sean Owen (cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252) Signed-off-by: Sean Owen --- python/pyspark/ml/tests/test_tuning.py | 131 ++--- python/pyspark/ml/tuning.py| 67 + 2 files changed, 172 insertions(+), 26 deletions(-) diff --git a/python/pyspark/ml/tests/test_tuning.py b/python/pyspark/ml/tests/test_tuning.py index 6bcc3f9..b250740 100644 --- a/python/pyspark/ml/tests/test_tuning.py +++ b/python/pyspark/ml/tests/test_tuning.py @@ -89,15 +89,50 @@ class CrossValidatorTests(SparkSessionTestCase): grid = (ParamGridBuilder() .addGrid(iee.inducedError, [100.0, 0.0, 1.0]) .build()) -cv = CrossValidator(estimator=iee, estimatorParamMaps=grid, evaluator=evaluator) +cv = CrossValidator( +estimator=iee, +estimatorParamMaps=grid, +evaluator=evaluator, +collectSubModels=True, +numFolds=2 +) cvCopied = cv.copy() -self.assertEqual(cv.getEstimator().uid, cvCopied.getEstimator().uid) +for param in [ +lambda x: x.getEstimator().uid, +# SPARK-32092: CrossValidator.copy() needs to copy all existing params +lambda x: x.getNumFolds(), +lambda x: x.getFoldCol(), +lambda x: x.getCollectSubModels(), +lambda x: x.getParallelism(), +lambda x: x.getSeed() +]: +self.assertEqual(param(cv), param(cvCopied)) cvModel = cv.fit(dataset) cvModelCopied = cvModel.copy() for index in range(len(cvModel.avgMetrics)): self.assertTrue(abs(cvModel.avgMetrics[index] - cvModelCopied.avgMetrics[index]) < 0.0001) +# SPARK-32092: CrossValidatorModel.copy() needs to copy all existing params +for param in [ +lambda x: x.getNumFolds(), +lambda x: x.getFoldCol(), +lambda x: x.getSeed() +]: +self.assertEqual(param(cvModel), param(cvModelCopied)) + +cvModel.avgMetrics[0] = 'foo' +self.assertNotEqual( +cvModelCopied.avgMetrics[0], +'foo', +"Changing the original avgMetrics should not affect the copied model" +) +cvModel.subModels[0] = 'foo' +self.assertNotEqual( +cvModelCopied.subModels[0], +'foo', +"Changing the original subModels should not affect the copied model" +) def test_fit_minimize_metric(self): dataset = self.spark.createDataFrame([ @@ -166,16 +201,39 @@ class CrossValidatorTests(SparkSessionTestCase): lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() -cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) +cv = CrossValidator( +estimator=lr, +estimatorParamMaps=grid, +evaluator=evaluator, +collectSubModels=True, +numFolds=4, +seed=42 +) cvModel = cv.fit(dataset)
[spark] branch branch-3.0 updated: [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 85c9e8c [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() 85c9e8c is described below commit 85c9e8c54c30b69c39075e97cd3cac295be09303 Author: Louiszr AuthorDate: Sat Aug 22 09:27:31 2020 -0500 [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() ### What changes were proposed in this pull request? Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods. ### Why are the changes needed? Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`. It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls. Closes #29445 from Louiszr/master. Authored-by: Louiszr Signed-off-by: Sean Owen (cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252) Signed-off-by: Sean Owen --- python/pyspark/ml/tests/test_tuning.py | 131 ++--- python/pyspark/ml/tuning.py| 67 + 2 files changed, 172 insertions(+), 26 deletions(-) diff --git a/python/pyspark/ml/tests/test_tuning.py b/python/pyspark/ml/tests/test_tuning.py index 6bcc3f9..b250740 100644 --- a/python/pyspark/ml/tests/test_tuning.py +++ b/python/pyspark/ml/tests/test_tuning.py @@ -89,15 +89,50 @@ class CrossValidatorTests(SparkSessionTestCase): grid = (ParamGridBuilder() .addGrid(iee.inducedError, [100.0, 0.0, 1.0]) .build()) -cv = CrossValidator(estimator=iee, estimatorParamMaps=grid, evaluator=evaluator) +cv = CrossValidator( +estimator=iee, +estimatorParamMaps=grid, +evaluator=evaluator, +collectSubModels=True, +numFolds=2 +) cvCopied = cv.copy() -self.assertEqual(cv.getEstimator().uid, cvCopied.getEstimator().uid) +for param in [ +lambda x: x.getEstimator().uid, +# SPARK-32092: CrossValidator.copy() needs to copy all existing params +lambda x: x.getNumFolds(), +lambda x: x.getFoldCol(), +lambda x: x.getCollectSubModels(), +lambda x: x.getParallelism(), +lambda x: x.getSeed() +]: +self.assertEqual(param(cv), param(cvCopied)) cvModel = cv.fit(dataset) cvModelCopied = cvModel.copy() for index in range(len(cvModel.avgMetrics)): self.assertTrue(abs(cvModel.avgMetrics[index] - cvModelCopied.avgMetrics[index]) < 0.0001) +# SPARK-32092: CrossValidatorModel.copy() needs to copy all existing params +for param in [ +lambda x: x.getNumFolds(), +lambda x: x.getFoldCol(), +lambda x: x.getSeed() +]: +self.assertEqual(param(cvModel), param(cvModelCopied)) + +cvModel.avgMetrics[0] = 'foo' +self.assertNotEqual( +cvModelCopied.avgMetrics[0], +'foo', +"Changing the original avgMetrics should not affect the copied model" +) +cvModel.subModels[0] = 'foo' +self.assertNotEqual( +cvModelCopied.subModels[0], +'foo', +"Changing the original subModels should not affect the copied model" +) def test_fit_minimize_metric(self): dataset = self.spark.createDataFrame([ @@ -166,16 +201,39 @@ class CrossValidatorTests(SparkSessionTestCase): lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() -cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) +cv = CrossValidator( +estimator=lr, +estimatorParamMaps=grid, +evaluator=evaluator, +collectSubModels=True, +numFolds=4, +seed=42 +) cvModel = cv.fit(dataset)
[spark] branch branch-3.0 updated: [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 85c9e8c [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() 85c9e8c is described below commit 85c9e8c54c30b69c39075e97cd3cac295be09303 Author: Louiszr AuthorDate: Sat Aug 22 09:27:31 2020 -0500 [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() ### What changes were proposed in this pull request? Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods. ### Why are the changes needed? Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`. It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls. Closes #29445 from Louiszr/master. Authored-by: Louiszr Signed-off-by: Sean Owen (cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252) Signed-off-by: Sean Owen --- python/pyspark/ml/tests/test_tuning.py | 131 ++--- python/pyspark/ml/tuning.py| 67 + 2 files changed, 172 insertions(+), 26 deletions(-) diff --git a/python/pyspark/ml/tests/test_tuning.py b/python/pyspark/ml/tests/test_tuning.py index 6bcc3f9..b250740 100644 --- a/python/pyspark/ml/tests/test_tuning.py +++ b/python/pyspark/ml/tests/test_tuning.py @@ -89,15 +89,50 @@ class CrossValidatorTests(SparkSessionTestCase): grid = (ParamGridBuilder() .addGrid(iee.inducedError, [100.0, 0.0, 1.0]) .build()) -cv = CrossValidator(estimator=iee, estimatorParamMaps=grid, evaluator=evaluator) +cv = CrossValidator( +estimator=iee, +estimatorParamMaps=grid, +evaluator=evaluator, +collectSubModels=True, +numFolds=2 +) cvCopied = cv.copy() -self.assertEqual(cv.getEstimator().uid, cvCopied.getEstimator().uid) +for param in [ +lambda x: x.getEstimator().uid, +# SPARK-32092: CrossValidator.copy() needs to copy all existing params +lambda x: x.getNumFolds(), +lambda x: x.getFoldCol(), +lambda x: x.getCollectSubModels(), +lambda x: x.getParallelism(), +lambda x: x.getSeed() +]: +self.assertEqual(param(cv), param(cvCopied)) cvModel = cv.fit(dataset) cvModelCopied = cvModel.copy() for index in range(len(cvModel.avgMetrics)): self.assertTrue(abs(cvModel.avgMetrics[index] - cvModelCopied.avgMetrics[index]) < 0.0001) +# SPARK-32092: CrossValidatorModel.copy() needs to copy all existing params +for param in [ +lambda x: x.getNumFolds(), +lambda x: x.getFoldCol(), +lambda x: x.getSeed() +]: +self.assertEqual(param(cvModel), param(cvModelCopied)) + +cvModel.avgMetrics[0] = 'foo' +self.assertNotEqual( +cvModelCopied.avgMetrics[0], +'foo', +"Changing the original avgMetrics should not affect the copied model" +) +cvModel.subModels[0] = 'foo' +self.assertNotEqual( +cvModelCopied.subModels[0], +'foo', +"Changing the original subModels should not affect the copied model" +) def test_fit_minimize_metric(self): dataset = self.spark.createDataFrame([ @@ -166,16 +201,39 @@ class CrossValidatorTests(SparkSessionTestCase): lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() -cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) +cv = CrossValidator( +estimator=lr, +estimatorParamMaps=grid, +evaluator=evaluator, +collectSubModels=True, +numFolds=4, +seed=42 +) cvModel = cv.fit(dataset)