[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974527#comment-16974527 ] John Bauer commented on SPARK-29691: [[SPARK-29691] ensure Param objects are valid in fit, transform|https://github.com/apache/spark/pull/26527] > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967773#comment-16967773 ] John Bauer commented on SPARK-29691: Yes, I can do that. An error message suggesting a call to getParam would get people on track. (I think that extending the API to include parameter names as above could be done safely, with a check that they could be bound to self, and an additional check in Pipeline.fit to prevent them being broadcast across a pipeline.) > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967748#comment-16967748 ] Bryan Cutler commented on SPARK-29691: -- [~JohnHBauer] I'm not sure we should extend the API to accept parameter names too, but it should definitely check that values in {{extra}} are an instance of {{Param}} and raise an error if not. Would that be ok with you and could you do a PR for this? > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741 ] John Bauer commented on SPARK-29691: I wonder if it would make sense to do this: {code:java} def _copyValues(self, to, extra=None): """ Copies param values from this instance to another instance for params shared by them. :param to: the target instance :param extra: extra params to be copied :return: the target instance with param values copied """ paramMap = self._paramMap.copy() if extra is not None: paramMap.update(extra) for param in self.params: # copy default params if param in self._defaultParamMap and to.hasParam(param.name): to._defaultParamMap[to.getParam(param.name)] = self._defaultParamMap[param] # copy explicitly set params if param in paramMap and to.hasParam(param.name): to._set(**{param.name: paramMap[param]}) # allow extra to update parameters on self by name, without having to call getParam first elif self.hasParam(param): to._set(**{param: paramMap[param]}) else: pass return to {code} This should allow: lr.fit(df, extra={"elasticNetParam": 0.3}) to produce the same result as: lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3}) > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966941#comment-16966941 ] John Bauer commented on SPARK-29691: OK that works. I worked with fit doing a grid search some time ago, and don't remember it working like this. I will check some of my earlier projects to see if my memory fails me... > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966929#comment-16966929 ] Huaxin Gao commented on SPARK-29691: Could you please change this line {code:java} lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}){code} to {code:java} lrModel1 = lr.fit(training, params={lr.elasticNetParam : 0.3}){code} and try again? > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer commented on SPARK-29691: I will update the example shortly - I was using this in the context of an MLflow Hyperparameter search, and none of the model outputs changed either. > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128 ] Huaxin Gao commented on SPARK-29691: I checked the doc and implementation. The Estimator fits the model using the passed in optional params instead of the embedded params, but it doesn't overwrite the estimator's embedded params values. In your case, the estimator uses 0.75 to fit the model, but it still keeps 0.8 for it's own elasticNetParam. If you get the model's parameters, it should have 0.75 for elasticNetParam. This seems to work as designed. # Fit the model, but with an updated parameter setting:lrModel = lr.fit(training, params={lor.elasticNetParam : 0.75})print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75 > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org