[jira] [Updated] (SPARK-34429) KMeansSummary class is omitted from PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-34429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-34429: --- Summary: KMeansSummary class is omitted from PySpark documentation (was: KMeansSummary class is omitted from PySPark documentation) > KMeansSummary class is omitted from PySpark documentation > - > > Key: SPARK-34429 > URL: https://issues.apache.org/jira/browse/SPARK-34429 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.7, 3.0.1 >Reporter: John Bauer >Priority: Minor > > `KMeansSummary` is missing from `__all__` in clustering.py, Sphinx omits it > from emitted documentation, and the class is invisible when imported by other > modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34429) KMeansSummary class is omitted from PySPark documentation
John Bauer created SPARK-34429: -- Summary: KMeansSummary class is omitted from PySPark documentation Key: SPARK-34429 URL: https://issues.apache.org/jira/browse/SPARK-34429 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.0.1, 2.4.7 Reporter: John Bauer `KMeansSummary` is missing from `__all__` in clustering.py, Sphinx omits it from emitted documentation, and the class is invisible when imported by other modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974527#comment-16974527 ] John Bauer commented on SPARK-29691: [[SPARK-29691] ensure Param objects are valid in fit, transform|https://github.com/apache/spark/pull/26527] > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967773#comment-16967773 ] John Bauer commented on SPARK-29691: Yes, I can do that. An error message suggesting a call to getParam would get people on track. (I think that extending the API to include parameter names as above could be done safely, with a check that they could be bound to self, and an additional check in Pipeline.fit to prevent them being broadcast across a pipeline.) > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741 ] John Bauer edited comment on SPARK-29691 at 11/5/19 6:13 PM: - I wonder if it would make sense to do this: {code:python} def _copyValues(self, to, extra=None): """ Copies param values from this instance to another instance for params shared by them. :param to: the target instance :param extra: extra params to be copied :return: the target instance with param values copied """ paramMap = self._paramMap.copy() if extra is not None: paramMap.update(extra) for param in self.params: # copy default params if param in self._defaultParamMap and to.hasParam(param.name): to._defaultParamMap[to.getParam(param.name)] = self._defaultParamMap[param] # copy explicitly set params if param in paramMap and to.hasParam(param.name): to._set(**{param.name: paramMap[param]}) # allow extra to update parameters on self by name, # without having to call getParam first elif self.hasParam(param): to._set(**{param: paramMap[param]}) else: pass return to {code} This should allow: {code:python} lr.fit(df, extra={"elasticNetParam": 0.3}) {code} to produce the same result as: {code:python} lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3}) {code} was (Author: johnhbauer): I wonder if it would make sense to do this: {code:java} def _copyValues(self, to, extra=None): """ Copies param values from this instance to another instance for params shared by them. :param to: the target instance :param extra: extra params to be copied :return: the target instance with param values copied """ paramMap = self._paramMap.copy() if extra is not None: paramMap.update(extra) for param in self.params: # copy default params if param in self._defaultParamMap and to.hasParam(param.name): to._defaultParamMap[to.getParam(param.name)] = self._defaultParamMap[param] # copy explicitly set params if param in paramMap and to.hasParam(param.name): to._set(**{param.name: paramMap[param]}) # allow extra to update parameters on self by name, # without having to call getParam first elif self.hasParam(param): to._set(**{param: paramMap[param]}) else: pass return to {code} This should allow: lr.fit(df, extra={"elasticNetParam": 0.3}) to produce the same result as: lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3}) > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741 ] John Bauer edited comment on SPARK-29691 at 11/5/19 6:11 PM: - I wonder if it would make sense to do this: {code:java} def _copyValues(self, to, extra=None): """ Copies param values from this instance to another instance for params shared by them. :param to: the target instance :param extra: extra params to be copied :return: the target instance with param values copied """ paramMap = self._paramMap.copy() if extra is not None: paramMap.update(extra) for param in self.params: # copy default params if param in self._defaultParamMap and to.hasParam(param.name): to._defaultParamMap[to.getParam(param.name)] = self._defaultParamMap[param] # copy explicitly set params if param in paramMap and to.hasParam(param.name): to._set(**{param.name: paramMap[param]}) # allow extra to update parameters on self by name, # without having to call getParam first elif self.hasParam(param): to._set(**{param: paramMap[param]}) else: pass return to {code} This should allow: lr.fit(df, extra={"elasticNetParam": 0.3}) to produce the same result as: lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3}) was (Author: johnhbauer): I wonder if it would make sense to do this: {code:java} def _copyValues(self, to, extra=None): """ Copies param values from this instance to another instance for params shared by them. :param to: the target instance :param extra: extra params to be copied :return: the target instance with param values copied """ paramMap = self._paramMap.copy() if extra is not None: paramMap.update(extra) for param in self.params: # copy default params if param in self._defaultParamMap and to.hasParam(param.name): to._defaultParamMap[to.getParam(param.name)] = self._defaultParamMap[param] # copy explicitly set params if param in paramMap and to.hasParam(param.name): to._set(**{param.name: paramMap[param]}) # allow extra to update parameters on self by name, without having to call getParam first elif self.hasParam(param): to._set(**{param: paramMap[param]}) else: pass return to {code} This should allow: lr.fit(df, extra={"elasticNetParam": 0.3}) to produce the same result as: lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3}) > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741 ] John Bauer commented on SPARK-29691: I wonder if it would make sense to do this: {code:java} def _copyValues(self, to, extra=None): """ Copies param values from this instance to another instance for params shared by them. :param to: the target instance :param extra: extra params to be copied :return: the target instance with param values copied """ paramMap = self._paramMap.copy() if extra is not None: paramMap.update(extra) for param in self.params: # copy default params if param in self._defaultParamMap and to.hasParam(param.name): to._defaultParamMap[to.getParam(param.name)] = self._defaultParamMap[param] # copy explicitly set params if param in paramMap and to.hasParam(param.name): to._set(**{param.name: paramMap[param]}) # allow extra to update parameters on self by name, without having to call getParam first elif self.hasParam(param): to._set(**{param: paramMap[param]}) else: pass return to {code} This should allow: lr.fit(df, extra={"elasticNetParam": 0.3}) to produce the same result as: lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3}) > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966941#comment-16966941 ] John Bauer commented on SPARK-29691: OK that works. I worked with fit doing a grid search some time ago, and don't remember it working like this. I will check some of my earlier projects to see if my memory fails me... > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer edited comment on SPARK-29691 at 11/4/19 5:20 PM: - I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params=...) was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel.explainParam("elasticNetParam")) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} Output: {noformat} elasticNetParam = 0.8, set through __init__ elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.8) Accuracy: 0.82 fMeasure: 0.8007300232766211 elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3} elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.8) Accuracy: 0.82 fMeasure: 0.8007300232766211 Correct results for elasticNetParam = 0.3, set through __init__ elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.3) Accuracy: 0.8933 fMeasure: 0.8922558922558923 {noformat} was (Author: johnhbauer): I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params=...) was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} Output: {noformat} elasticNetParam = 0.8, set through __init__ LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, numFeatures = 4 Accuracy: 0.82 fMeasure: 0.8007300232766211 elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3} LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, numFeatures = 4 Accuracy: 0.82 fMeasure: 0.8007300232766211 Correct results for elasticNetParam = 0.3, set through __init__ LogisticRegressionModel: uid = LogisticRegression_cb376c90572e, numClasses = 3, numFeatures = 4 Accuracy: 0.8933 fMeasure: 0.8922558922558923 {noformat} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer edited comment on SPARK-29691 at 11/4/19 4:59 PM: - I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params=...) was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} Output: {noformat} elasticNetParam = 0.8, set through __init__ LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, numFeatures = 4 Accuracy: 0.82 fMeasure: 0.8007300232766211 elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3} LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, numFeatures = 4 Accuracy: 0.82 fMeasure: 0.8007300232766211 Correct results for elasticNetParam = 0.3, set through __init__ LogisticRegressionModel: uid = LogisticRegression_cb376c90572e, numClasses = 3, numFeatures = 4 Accuracy: 0.8933 fMeasure: 0.8922558922558923 {noformat} was (Author: johnhbauer): I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params=...) was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer edited comment on SPARK-29691 at 11/4/19 4:57 PM: - I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params=...) was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} was (Author: johnhbauer): I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params={"elasticNetParam": 0.3} was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer edited comment on SPARK-29691 at 11/4/19 4:57 PM: - I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(params={"elasticNetParam": 0.3} was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} was (Author: johnhbauer): I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(... , params={ ... } was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer edited comment on SPARK-29691 at 11/4/19 4:56 PM: - I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(... , params={ ... } was used. Updated example: {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} was (Author: johnhbauer): I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(... , params={ ... } was used. Updated example: {code:python} def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer edited comment on SPARK-29691 at 11/4/19 4:55 PM: - I was using this in the context of an MLflow Hyperparameter search. None of the model outputs changed whatsoever when fit(... , params={ ... } was used. Updated example: {code:python} def printStats(lrModel, title): trainingSummary = lrModel.summary print(title) print(lrModel) print("Accuracy:", trainingSummary.accuracy) print("fMeasure:", trainingSummary.weightedFMeasure()) print("") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) printStats(lrModel, "elasticNetParam = 0.8, set through __init__") lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3}) printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through fit(..., params={"elasticNetParam" : 0.3}""") lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3) lrModel03 = lr03.fit(training) printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through __init__""") {code} was (Author: johnhbauer): I will update the example shortly - I was using this in the context of an MLflow Hyperparameter search, and none of the model outputs changed either. > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768 ] John Bauer commented on SPARK-29691: I will update the example shortly - I was using this in the context of an MLflow Hyperparameter search, and none of the model outputs changed either. > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964481#comment-16964481 ] John Bauer commented on SPARK-12806: Also, when using PyArrow to convert a Spark DataFrame for use in a pandas_udf, as soon as a VectorUDT is encountered it reverts to a non-optimized conversion, losing much of the advantage of using PyArrow. > Support SQL expressions extracting values from VectorUDT > > > Key: SPARK-12806 > URL: https://issues.apache.org/jira/browse/SPARK-12806 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL >Affects Versions: 1.6.0 >Reporter: Feynman Liang >Priority: Major > Labels: bulk-closed > > Use cases exist where a specific index within a {{VectorUDT}} column of a > {{DataFrame}} is required. For example, we may be interested in extracting a > specific class probability from the {{probabilityCol}} of a > {{LogisticRegression}} to compute losses. However, if {{probability}} is a > column of {{df}} with type {{VectorUDT}}, the following code fails: > {code} > df.select("probability.0") > AnalysisException: u"Can't extract value from probability" > {code} > thrown from > {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}. > {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it > to support value extraction Expressions in an analogous way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964475#comment-16964475 ] John Bauer commented on SPARK-12806: This is still a problem. For example, classification models emit probability as a VectorUDT, which are unusable in PySpark. This makes constructing boosting/bagging algorithms or even just using them as additional features in a second model problematic. > Support SQL expressions extracting values from VectorUDT > > > Key: SPARK-12806 > URL: https://issues.apache.org/jira/browse/SPARK-12806 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL >Affects Versions: 1.6.0 >Reporter: Feynman Liang >Priority: Major > Labels: bulk-closed > > Use cases exist where a specific index within a {{VectorUDT}} column of a > {{DataFrame}} is required. For example, we may be interested in extracting a > specific class probability from the {{probabilityCol}} of a > {{LogisticRegression}} to compute losses. However, if {{probability}} is a > column of {{df}} with type {{VectorUDT}}, the following code fails: > {code} > df.select("probability.0") > AnalysisException: u"Can't extract value from probability" > {code} > thrown from > {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}. > {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it > to support value extraction Expressions in an analogous way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-29691: --- Description: Estimator `fit` method is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. (The copy method that interacts with Java is actually implemented in Params.) For example, this prints Before: 0.8 After: 0.8 but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} was: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints Before: 0.8 After: 0.8 but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-29691: --- Description: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints Before: 0.8 After: 0.8 but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} was: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method (implemented in Params) is supposed to copy a > dictionary of params, overwriting the estimator's previous values, before > fitting the model. However, the parameter values are not updated. This was > observed in PySpark, but may be present in the Java objects, as the PySpark > code appears to be functioning correctly. > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-29691: --- Description: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} was: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {{from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam"))}} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method (implemented in Params) is supposed to copy a > dictionary of params, overwriting the estimator's previous values, before > fitting the model. However, the parameter values are not updated. This was > observed in PySpark, but may be present in the Java objects, as the PySpark > code appears to be functioning correctly. > For example, this prints > {{Before: 0.8 > After: 0.8}} > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
John Bauer created SPARK-29691: -- Summary: Estimator fit method fails to copy params (in PySpark) Key: SPARK-29691 URL: https://issues.apache.org/jira/browse/SPARK-29691 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: John Bauer Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {{from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam"))}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855948#comment-16855948 ] John Bauer edited comment on SPARK-17025 at 6/4/19 11:12 PM: - [~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] It imputes missing values from a normal distribution, using mean and standard deviation parameters estimated from the data, so it might be useful for that too. Let me know if it helps you, was (Author: johnhbauer): [~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855948#comment-16855948 ] John Bauer edited comment on SPARK-17025 at 6/4/19 11:10 PM: - [~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, was (Author: johnhbauer): [~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855948#comment-16855948 ] John Bauer commented on SPARK-17025: [~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer edited comment on SPARK-21542 at 11/9/18 8:07 PM: - Compared to the previous, the above example is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("impute_model") impm.explainParams(){code} was (Author: johnhbauer): This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("impute_model") impm.explainParams(){code} > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer edited comment on SPARK-21542 at 11/9/18 7:56 PM: - This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("impute_model") impm.explainParams(){code} was (Author: johnhbauer): This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev(){code} > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer commented on SPARK-21542: This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev() > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895 ] John Bauer edited comment on SPARK-21542 at 11/9/18 7:54 PM: - This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: {code:java} impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev(){code} was (Author: johnhbauer): This is a) much more minimal, b) genuinely useful, and c) actually works with save and load, for example: impute.write().save("impute") imp = ImputeNormal.load("impute") imp.explainParams() impute_model.write().save("impute_model") impm = ImputeNormalModel.load("imputer_model") impm = ImputeNormalModel.load("impute_model") impm.getInputCol() impm.getOutputCol() impm.getMean() impm.getStddev() > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681891#comment-16681891 ] John Bauer commented on SPARK-21542: {code} from pyspark.sql import SparkSession from pyspark.sql.functions import col, when, randn from pyspark import keyword_only from pyspark.ml import Estimator, Model #from pyspark.ml.feature import SQLTransformer from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable from pyspark.ml.param import Param, Params, TypeConverters from pyspark.ml.param.shared import HasInputCol, HasOutputCol spark = SparkSession\ .builder\ .appName("ImputeNormal")\ .getOrCreate() class ImputeNormal(Estimator, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable, ): @keyword_only def __init__(self, inputCol="inputCol", outputCol="outputCol"): super(ImputeNormal, self).__init__() self._setDefault(inputCol="inputCol", outputCol="outputCol") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol="inputCol", outputCol="outputCol"): """ setParams(self, inputCol="inputCol", outputCol="outputCol") """ kwargs = self._input_kwargs self._set(**kwargs) return self def _fit(self, data): inputCol = self.getInputCol() outputCol = self.getOutputCol() stats = data.select(inputCol).describe() mean = stats.where(col("summary") == "mean").take(1)[0][inputCol] stddev = stats.where(col("summary") == "stddev").take(1)[0][inputCol] return ImputeNormalModel(mean=float(mean), stddev=float(stddev), inputCol=inputCol, outputCol=outputCol, ) # FOR A TRULY MINIMAL BUT LESS DIDACTICALLY EFFECTIVE DEMO, DO INSTEAD: #sql_text = "SELECT *, IF({inputCol} IS NULL, {stddev} * randn() + {mean}, {inputCol}) AS {outputCol} FROM __THIS__" # #return SQLTransformer(statement=sql_text.format(stddev=stddev, mean=mean, inputCol=inputCol, outputCol=outputCol)) class ImputeNormalModel(Model, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable, ): mean = Param(Params._dummy(), "mean", "Mean value of imputations. Calculated by fit method.", typeConverter=TypeConverters.toFloat) stddev = Param(Params._dummy(), "stddev", "Standard deviation of imputations. Calculated by fit method.", typeConverter=TypeConverters.toFloat) @keyword_only def __init__(self, mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol"): super(ImputeNormalModel, self).__init__() self._setDefault(mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol"): """ setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", outputCol="outputCol") """ kwargs = self._input_kwargs self._set(**kwargs) return self def getMean(self): return self.getOrDefault(self.mean) def setMean(self, mean): self._set(mean=mean) def getStddev(self): return self.getOrDefault(self.stddev) def setStddev(self, stddev): self._set(stddev=stddev) def _transform(self, data): mean = self.getMean() stddev = self.getStddev() inputCol = self.getInputCol() outputCol = self.getOutputCol() df = data.withColumn(outputCol, when(col(inputCol).isNull(), stddev * randn() + mean).\ otherwise(col(inputCol))) return df if __name__ == "__main__": train = spark.createDataFrame([[0],[1],[2]] + [[None]]*100,['input']) impute = ImputeNormal(inputCol='input', outputCol='output') impute_model = impute.fit(train) print("Input column: {}".format(impute_model.getInputCol())) print("Output column: {}".format(impute_model.getOutputCol())) print("Mean: {}".format(impute_model.getMean())) print("Standard Deviation: {}".format(impute_model.getStddev())) test = impute_model.transform(train) test.show(10) test.describe().show() print("mean and stddev for outputCol should be close to those of inputCol"){code} > Helper functions for custom Python Persistence >
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634679#comment-16634679 ] John Bauer commented on SPARK-21542: The above is not as minimal as I would have liked... It is based on the unit tests associated with the fix referenced for DefaultParamsReadable, DefaultParamsWritable which I thought would test the desired behavior, i.e. save and load a pipeline after calling fit(). Unfortunately this was not tested, so I flailed at the code for a while until I got something that worked. A lot of stuff left over from setting up unit tests could probably be removed. But at least this seems to work.. > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634677#comment-16634677 ] John Bauer commented on SPARK-21542: {code:python} #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Sep 27 10:25:10 2018 @author: JohnBauer """ from pyspark.sql import DataFrame, Row from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.functions import lit from pyspark.sql.functions import udf from pyspark import keyword_only, SparkContext from pyspark.ml import Estimator, Model, Pipeline, PipelineModel, Transformer, UnaryTransformer from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable #from pyspark.ml.util import * from pyspark.ml.param import Param, Params, TypeConverters from pyspark.ml.param.shared import HasInputCol, HasOutputCol from pyspark.sql.types import FloatType, DoubleType #, LongType, ArrayType, StringType, StructType, StructField spark = SparkSession\ .builder\ .appName("Minimal_1")\ .getOrCreate() data_path = "/Users/JohnBauer/spark/data/mllib" # Load training data data = spark.read.format("libsvm").load("{}/sample_libsvm_data.txt".format(data_path)) train, test = data.randomSplit([0.7, 0.3]) train.show(5) class MockDataset(DataFrame): def __init__(self): self.index = 0 class HasFake(Params): def __init__(self): super(HasFake, self).__init__() self.fake = Param(self, "fake", "fake param") def getFake(self): return self.getOrDefault(self.fake) class MockTransformer(Transformer, DefaultParamsReadable, DefaultParamsWritable, HasFake): def __init__(self): super(MockTransformer, self).__init__() self.dataset_index = None def _transform(self, dataset): self.dataset_index = dataset.index dataset.index += 1 return dataset class MockUnaryTransformer(UnaryTransformer, DefaultParamsReadable, DefaultParamsWritable,): #HasInputCol): shift = Param(Params._dummy(), "shift", "The amount by which to shift " + "data in a DataFrame", typeConverter=TypeConverters.toFloat) inputCol = Param(Params._dummy(), "inputCol", "column of DataFrame to transform", typeConverter=TypeConverters.toString) outputCol = Param(Params._dummy(), "outputCol", "name of transformed column " + "to be added to DataFrame", typeConverter=TypeConverters.toString) @keyword_only def __init__(self, shiftVal=1, inputCol="features", outputCol="outputCol"): #, inputCol='features'): super(MockUnaryTransformer, self).__init__() self._setDefault(shift=1) self._set(shift=shiftVal) self._setDefault(inputCol=inputCol) self._setDefault(outputCol=outputCol) def getShift(self): return self.getOrDefault(self.shift) def setShift(self, shift): self._set(shift=shift) def createTransformFunc(self): shiftVal = self.getShift() return lambda x: x + shiftVal def outputDataType(self): return DoubleType() def validateInputType(self, inputType): if inputType != DoubleType(): print("input type: {}".format(inputType)) return #raise TypeError("Bad input type: {}. ".format(inputType) + #"Requires Double.") def _transform(self, dataset): shift = self.getOrDefault("shift") def f(v): return v + shift t = FloatType() out_col = self.getOutputCol() in_col = dataset[self.getInputCol()] return dataset.withColumn(out_col, udf(f, t)(in_col)) class MockEstimator(Estimator, DefaultParamsReadable, DefaultParamsWritable, HasFake): def __init__(self): super(MockEstimator, self).__init__() self.dataset_index = None def _fit(self, dataset): self.dataset_index = dataset.index model = MockModel() self._copyValues(model) return model class MockModel(MockTransformer, Model, HasFake): pass #class PipelineTests(PySparkTestCase): class PipelineTests(object): def test_pipeline(self, data=None): #dataset = MockDataset() dataset = MockDataset() if data is None else data estimator0 = MockEstimator() transformer1 = MockTransformer() estimator2 = MockEstimator() transformer3 = MockTransformer() transformer4 = MockUnaryTransformer(inputCol="label", outputCol="shifted_label") pipeline = Pipeline(stages=[estimator0, transformer1, estimator2, transformer3, transformer4]) pipeline_model = pipeline.fit(dataset,
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611085#comment-16611085 ] John Bauer commented on SPARK-21542: You don't show your code for __init__ or setParams. I recall getting this error before using the @keyword_only decorator, for example see https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml I will be trying to get my custom transformer pipeline to persist sometime next week I hope. If I succeed, I will try to provide an example if no one else has. > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23955) typo in parameter name 'rawPredicition'
John Bauer created SPARK-23955: -- Summary: typo in parameter name 'rawPredicition' Key: SPARK-23955 URL: https://issues.apache.org/jira/browse/SPARK-23955 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: John Bauer classifier.py MultilayerPerceptronClassifier.__init__ API call had typo rawPredicition instead of rawPrediction also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org