[spark] branch branch-3.0 updated: [SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values between Scala and Python

2020-07-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 8a6580c  [SPARK-32232][ML][PYSPARK] Make sure ML has the same default 
solver values between Scala and Python
8a6580c is described below

commit 8a6580cedf509b5f62175e4ed7b2e0882dd89976
Author: Huaxin Gao 
AuthorDate: Sat Jul 11 10:37:26 2020 -0500

[SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values 
between Scala and Python

# What changes were proposed in this pull request?
current problems:
```
mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123)
model = mlp.fit(df)
path = tempfile.mkdtemp()
model_path = path + "/mlp"
model.save(model_path)
model2 = MultilayerPerceptronClassificationModel.load(model_path)
self.assertEqual(model2.getSolver(), "l-bfgs")# this fails 
because model2.getSolver() returns 'auto'
model2.transform(df)
# this fails with Exception 
pyspark.sql.utils.IllegalArgumentException: 
MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid 
value auto.
```
FMClassifier/Regression and GeneralizedLinearRegression have the same 
problems.

Here are the root cause of the problems:
1. In HasSolver, both Scala and Python default solver to 'auto'

2. On Scala side, mlp overrides the default of solver to 'l-bfgs', 
FMClassifier/Regression overrides the default of solver to 'adamW', and glr 
overrides the default of solver to 'irls'

3. On Scala side, mlp overrides the default of solver in 
MultilayerPerceptronClassificationParams, so both 
MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel 
have 'l-bfgs' as default

4. On Python side, mlp overrides the default of solver in 
MultilayerPerceptronClassification, so it has default as 'l-bfgs', but 
MultilayerPerceptronClassificationModel doesn't override the default so it gets 
the default from HasSolver which is 'auto'. In theory, we don't care about the 
solver value or any other params values for 
MultilayerPerceptronClassificationModel, because we have the fitted model 
already. That's why on Python side, we never set default values for any of the 
XXXModel.

5. when calling getSolver on the loaded mlp model, it calls this line of 
code underneath:
```
def _transfer_params_from_java(self):
"""
Transforms the embedded params from the companion Java object.
"""
..
# SPARK-14931: Only check set params back to avoid default 
params mismatch.
if self._java_obj.isSet(java_param):
value = _java2py(sc, 
self._java_obj.getOrDefault(java_param))
self._set(**{param.name: value})
..
```
that's why model2.getSolver() returns 'auto'. The code doesn't get the 
default Scala value (in this case 'l-bfgs') to set to Python param, so it takes 
the default value (in this case 'auto') on Python side.

6. when calling model2.transform(df), it calls this underneath:
```
def _transfer_params_to_java(self):
"""
Transforms the embedded params to the companion Java object.
"""
..
if self.hasDefault(param):
pair = self._make_java_param_pair(param, 
self._defaultParamMap[param])
pair_defaults.append(pair)
..

```
Again, it gets the Python default solver which is 'auto', and this caused 
the Exception

7. Currently, on Scala side, for some of the algorithms, we set default 
values in the XXXParam, so both estimator and transformer get the default 
value. However, for some of the algorithms, we only set default in estimators, 
and the XXXModel doesn't get the default value. On Python side, we never set 
defaults for the XXXModel. This causes the default value inconsistency.

8. My proposed solution: set default params in XXXParam for both Scala and 
Python, so both the estimator and transformer have the same default value for 
both Scala and Python. I currently only changed solver in this PR. If everyone 
is OK with the fix, I will change all the other params as well.

I hope my explanation makes sense to your folks :)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing and new tests

Closes #29060 from huaxingao/solver_parity.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit 99b4b062555329d5da968ad5dbd9e2b22a193a55)
Signed-off-by: 

[spark] branch branch-3.0 updated: [SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values between Scala and Python

2020-07-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 8a6580c  [SPARK-32232][ML][PYSPARK] Make sure ML has the same default 
solver values between Scala and Python
8a6580c is described below

commit 8a6580cedf509b5f62175e4ed7b2e0882dd89976
Author: Huaxin Gao 
AuthorDate: Sat Jul 11 10:37:26 2020 -0500

[SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values 
between Scala and Python

# What changes were proposed in this pull request?
current problems:
```
mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123)
model = mlp.fit(df)
path = tempfile.mkdtemp()
model_path = path + "/mlp"
model.save(model_path)
model2 = MultilayerPerceptronClassificationModel.load(model_path)
self.assertEqual(model2.getSolver(), "l-bfgs")# this fails 
because model2.getSolver() returns 'auto'
model2.transform(df)
# this fails with Exception 
pyspark.sql.utils.IllegalArgumentException: 
MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid 
value auto.
```
FMClassifier/Regression and GeneralizedLinearRegression have the same 
problems.

Here are the root cause of the problems:
1. In HasSolver, both Scala and Python default solver to 'auto'

2. On Scala side, mlp overrides the default of solver to 'l-bfgs', 
FMClassifier/Regression overrides the default of solver to 'adamW', and glr 
overrides the default of solver to 'irls'

3. On Scala side, mlp overrides the default of solver in 
MultilayerPerceptronClassificationParams, so both 
MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel 
have 'l-bfgs' as default

4. On Python side, mlp overrides the default of solver in 
MultilayerPerceptronClassification, so it has default as 'l-bfgs', but 
MultilayerPerceptronClassificationModel doesn't override the default so it gets 
the default from HasSolver which is 'auto'. In theory, we don't care about the 
solver value or any other params values for 
MultilayerPerceptronClassificationModel, because we have the fitted model 
already. That's why on Python side, we never set default values for any of the 
XXXModel.

5. when calling getSolver on the loaded mlp model, it calls this line of 
code underneath:
```
def _transfer_params_from_java(self):
"""
Transforms the embedded params from the companion Java object.
"""
..
# SPARK-14931: Only check set params back to avoid default 
params mismatch.
if self._java_obj.isSet(java_param):
value = _java2py(sc, 
self._java_obj.getOrDefault(java_param))
self._set(**{param.name: value})
..
```
that's why model2.getSolver() returns 'auto'. The code doesn't get the 
default Scala value (in this case 'l-bfgs') to set to Python param, so it takes 
the default value (in this case 'auto') on Python side.

6. when calling model2.transform(df), it calls this underneath:
```
def _transfer_params_to_java(self):
"""
Transforms the embedded params to the companion Java object.
"""
..
if self.hasDefault(param):
pair = self._make_java_param_pair(param, 
self._defaultParamMap[param])
pair_defaults.append(pair)
..

```
Again, it gets the Python default solver which is 'auto', and this caused 
the Exception

7. Currently, on Scala side, for some of the algorithms, we set default 
values in the XXXParam, so both estimator and transformer get the default 
value. However, for some of the algorithms, we only set default in estimators, 
and the XXXModel doesn't get the default value. On Python side, we never set 
defaults for the XXXModel. This causes the default value inconsistency.

8. My proposed solution: set default params in XXXParam for both Scala and 
Python, so both the estimator and transformer have the same default value for 
both Scala and Python. I currently only changed solver in this PR. If everyone 
is OK with the fix, I will change all the other params as well.

I hope my explanation makes sense to your folks :)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing and new tests

Closes #29060 from huaxingao/solver_parity.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit 99b4b062555329d5da968ad5dbd9e2b22a193a55)
Signed-off-by: 

[spark] branch branch-3.0 updated: [SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values between Scala and Python

2020-07-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 8a6580c  [SPARK-32232][ML][PYSPARK] Make sure ML has the same default 
solver values between Scala and Python
8a6580c is described below

commit 8a6580cedf509b5f62175e4ed7b2e0882dd89976
Author: Huaxin Gao 
AuthorDate: Sat Jul 11 10:37:26 2020 -0500

[SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values 
between Scala and Python

# What changes were proposed in this pull request?
current problems:
```
mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123)
model = mlp.fit(df)
path = tempfile.mkdtemp()
model_path = path + "/mlp"
model.save(model_path)
model2 = MultilayerPerceptronClassificationModel.load(model_path)
self.assertEqual(model2.getSolver(), "l-bfgs")# this fails 
because model2.getSolver() returns 'auto'
model2.transform(df)
# this fails with Exception 
pyspark.sql.utils.IllegalArgumentException: 
MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid 
value auto.
```
FMClassifier/Regression and GeneralizedLinearRegression have the same 
problems.

Here are the root cause of the problems:
1. In HasSolver, both Scala and Python default solver to 'auto'

2. On Scala side, mlp overrides the default of solver to 'l-bfgs', 
FMClassifier/Regression overrides the default of solver to 'adamW', and glr 
overrides the default of solver to 'irls'

3. On Scala side, mlp overrides the default of solver in 
MultilayerPerceptronClassificationParams, so both 
MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel 
have 'l-bfgs' as default

4. On Python side, mlp overrides the default of solver in 
MultilayerPerceptronClassification, so it has default as 'l-bfgs', but 
MultilayerPerceptronClassificationModel doesn't override the default so it gets 
the default from HasSolver which is 'auto'. In theory, we don't care about the 
solver value or any other params values for 
MultilayerPerceptronClassificationModel, because we have the fitted model 
already. That's why on Python side, we never set default values for any of the 
XXXModel.

5. when calling getSolver on the loaded mlp model, it calls this line of 
code underneath:
```
def _transfer_params_from_java(self):
"""
Transforms the embedded params from the companion Java object.
"""
..
# SPARK-14931: Only check set params back to avoid default 
params mismatch.
if self._java_obj.isSet(java_param):
value = _java2py(sc, 
self._java_obj.getOrDefault(java_param))
self._set(**{param.name: value})
..
```
that's why model2.getSolver() returns 'auto'. The code doesn't get the 
default Scala value (in this case 'l-bfgs') to set to Python param, so it takes 
the default value (in this case 'auto') on Python side.

6. when calling model2.transform(df), it calls this underneath:
```
def _transfer_params_to_java(self):
"""
Transforms the embedded params to the companion Java object.
"""
..
if self.hasDefault(param):
pair = self._make_java_param_pair(param, 
self._defaultParamMap[param])
pair_defaults.append(pair)
..

```
Again, it gets the Python default solver which is 'auto', and this caused 
the Exception

7. Currently, on Scala side, for some of the algorithms, we set default 
values in the XXXParam, so both estimator and transformer get the default 
value. However, for some of the algorithms, we only set default in estimators, 
and the XXXModel doesn't get the default value. On Python side, we never set 
defaults for the XXXModel. This causes the default value inconsistency.

8. My proposed solution: set default params in XXXParam for both Scala and 
Python, so both the estimator and transformer have the same default value for 
both Scala and Python. I currently only changed solver in this PR. If everyone 
is OK with the fix, I will change all the other params as well.

I hope my explanation makes sense to your folks :)

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing and new tests

Closes #29060 from huaxingao/solver_parity.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit 99b4b062555329d5da968ad5dbd9e2b22a193a55)
Signed-off-by: