[jira] [Updated] (SPARK-34429) KMeansSummary class is omitted from PySpark documentation

2021-02-12 Thread John Bauer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-34429:
---
Summary: KMeansSummary class is omitted from PySpark documentation  (was: 
KMeansSummary class is omitted from PySPark documentation)

> KMeansSummary class is omitted from PySpark documentation
> -
>
> Key: SPARK-34429
> URL: https://issues.apache.org/jira/browse/SPARK-34429
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.7, 3.0.1
>Reporter: John Bauer
>Priority: Minor
>
> `KMeansSummary` is missing from `__all__` in clustering.py, Sphinx omits it 
> from emitted documentation, and the class is invisible when imported by other 
> modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34429) KMeansSummary class is omitted from PySPark documentation

2021-02-12 Thread John Bauer (Jira)
John Bauer created SPARK-34429:
--

 Summary: KMeansSummary class is omitted from PySPark documentation
 Key: SPARK-34429
 URL: https://issues.apache.org/jira/browse/SPARK-34429
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.1, 2.4.7
Reporter: John Bauer


`KMeansSummary` is missing from `__all__` in clustering.py, Sphinx omits it 
from emitted documentation, and the class is invisible when imported by other 
modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-14 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974527#comment-16974527
 ] 

John Bauer commented on SPARK-29691:


[[SPARK-29691] ensure Param objects are valid in fit, 
transform|https://github.com/apache/spark/pull/26527]

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-05 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967773#comment-16967773
 ] 

John Bauer commented on SPARK-29691:


Yes, I can do that.  An error message suggesting a call to getParam would get 
people on track.  (I think that extending the API to include parameter names as 
above could be done safely, with a check that they could be bound to self, and 
an additional check in Pipeline.fit to prevent them being broadcast across a 
pipeline.)

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-05 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741
 ] 

John Bauer edited comment on SPARK-29691 at 11/5/19 6:13 PM:
-

I wonder if it would make sense to do this:

{code:python}
def _copyValues(self, to, extra=None):
"""
Copies param values from this instance to another instance for
params shared by them.

:param to: the target instance
:param extra: extra params to be copied
:return: the target instance with param values copied
"""
paramMap = self._paramMap.copy()
if extra is not None:
paramMap.update(extra)
for param in self.params:
# copy default params
if param in self._defaultParamMap and to.hasParam(param.name):
to._defaultParamMap[to.getParam(param.name)] = 
self._defaultParamMap[param]
# copy explicitly set params
if param in paramMap and to.hasParam(param.name):
to._set(**{param.name: paramMap[param]})
# allow extra to update parameters on self by name,
# without having to call getParam first
elif self.hasParam(param):
to._set(**{param: paramMap[param]})
else:
pass
return to
{code}
This should allow:
{code:python}
lr.fit(df, extra={"elasticNetParam": 0.3})
{code}
to produce the same result as:
{code:python}
lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3})
{code}


was (Author: johnhbauer):
I wonder if it would make sense to do this:

{code:java}
def _copyValues(self, to, extra=None):
"""
Copies param values from this instance to another instance for
params shared by them.

:param to: the target instance
:param extra: extra params to be copied
:return: the target instance with param values copied
"""
paramMap = self._paramMap.copy()
if extra is not None:
paramMap.update(extra)
for param in self.params:
# copy default params
if param in self._defaultParamMap and to.hasParam(param.name):
to._defaultParamMap[to.getParam(param.name)] = 
self._defaultParamMap[param]
# copy explicitly set params
if param in paramMap and to.hasParam(param.name):
to._set(**{param.name: paramMap[param]})
# allow extra to update parameters on self by name,
# without having to call getParam first
elif self.hasParam(param):
to._set(**{param: paramMap[param]})
else:
pass
return to
{code}
This should allow:
lr.fit(df, extra={"elasticNetParam": 0.3}) 
to produce the same result as:
lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3})

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-05 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741
 ] 

John Bauer edited comment on SPARK-29691 at 11/5/19 6:11 PM:
-

I wonder if it would make sense to do this:

{code:java}
def _copyValues(self, to, extra=None):
"""
Copies param values from this instance to another instance for
params shared by them.

:param to: the target instance
:param extra: extra params to be copied
:return: the target instance with param values copied
"""
paramMap = self._paramMap.copy()
if extra is not None:
paramMap.update(extra)
for param in self.params:
# copy default params
if param in self._defaultParamMap and to.hasParam(param.name):
to._defaultParamMap[to.getParam(param.name)] = 
self._defaultParamMap[param]
# copy explicitly set params
if param in paramMap and to.hasParam(param.name):
to._set(**{param.name: paramMap[param]})
# allow extra to update parameters on self by name,
# without having to call getParam first
elif self.hasParam(param):
to._set(**{param: paramMap[param]})
else:
pass
return to
{code}
This should allow:
lr.fit(df, extra={"elasticNetParam": 0.3}) 
to produce the same result as:
lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3})


was (Author: johnhbauer):
I wonder if it would make sense to do this:

{code:java}
def _copyValues(self, to, extra=None):
"""
Copies param values from this instance to another instance for
params shared by them.

:param to: the target instance
:param extra: extra params to be copied
:return: the target instance with param values copied
"""
paramMap = self._paramMap.copy()
if extra is not None:
paramMap.update(extra)
for param in self.params:
# copy default params
if param in self._defaultParamMap and to.hasParam(param.name):
to._defaultParamMap[to.getParam(param.name)] = 
self._defaultParamMap[param]
# copy explicitly set params
if param in paramMap and to.hasParam(param.name):
to._set(**{param.name: paramMap[param]})
# allow extra to update parameters on self by name, without having 
to call getParam first
elif self.hasParam(param):
to._set(**{param: paramMap[param]})
else:
pass
return to
{code}
This should allow:
lr.fit(df, extra={"elasticNetParam": 0.3}) 
to produce the same result as:
lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3})

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-05 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967741#comment-16967741
 ] 

John Bauer commented on SPARK-29691:


I wonder if it would make sense to do this:

{code:java}
def _copyValues(self, to, extra=None):
"""
Copies param values from this instance to another instance for
params shared by them.

:param to: the target instance
:param extra: extra params to be copied
:return: the target instance with param values copied
"""
paramMap = self._paramMap.copy()
if extra is not None:
paramMap.update(extra)
for param in self.params:
# copy default params
if param in self._defaultParamMap and to.hasParam(param.name):
to._defaultParamMap[to.getParam(param.name)] = 
self._defaultParamMap[param]
# copy explicitly set params
if param in paramMap and to.hasParam(param.name):
to._set(**{param.name: paramMap[param]})
# allow extra to update parameters on self by name, without having 
to call getParam first
elif self.hasParam(param):
to._set(**{param: paramMap[param]})
else:
pass
return to
{code}
This should allow:
lr.fit(df, extra={"elasticNetParam": 0.3}) 
to produce the same result as:
lr.fit(df, extra={lr.getParam("elasticNetParam"): 0.3})

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966941#comment-16966941
 ] 

John Bauer commented on SPARK-29691:


OK that works.  I worked with fit doing a grid search some time ago, and don't 
remember it working like this.  I will check some of my earlier projects to see 
if my memory fails me...

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer edited comment on SPARK-29691 at 11/4/19 5:20 PM:
-

I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params=...) was used.  Updated 
example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel.explainParam("elasticNetParam"))
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}
Output:

{noformat}
elasticNetParam = 0.8, set through __init__
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 
0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 
0.0, current: 0.8)
Accuracy: 0.82
fMeasure: 0.8007300232766211

elasticNetParam still 0.8, after trying to set through fit(..., 
params={"elasticNetParam" : 0.3}
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 
0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 
0.0, current: 0.8)
Accuracy: 0.82
fMeasure: 0.8007300232766211

Correct results for elasticNetParam = 0.3, set through __init__
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 
0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 
0.0, current: 0.3)
Accuracy: 0.8933
fMeasure: 0.8922558922558923
{noformat}



was (Author: johnhbauer):
I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params=...) was used.  Updated 
example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}
Output:

{noformat}
elasticNetParam = 0.8, set through __init__
LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, 
numFeatures = 4
Accuracy: 0.82
fMeasure: 0.8007300232766211

elasticNetParam still 0.8, after trying to set through fit(..., 
params={"elasticNetParam" : 0.3}
LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, 
numFeatures = 4
Accuracy: 0.82
fMeasure: 0.8007300232766211

Correct results for elasticNetParam = 0.3, set through __init__
LogisticRegressionModel: uid = LogisticRegression_cb376c90572e, numClasses = 3, 
numFeatures = 4
Accuracy: 0.8933
fMeasure: 0.8922558922558923
{noformat}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning 

[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer edited comment on SPARK-29691 at 11/4/19 4:59 PM:
-

I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params=...) was used.  Updated 
example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}
Output:

{noformat}
elasticNetParam = 0.8, set through __init__
LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, 
numFeatures = 4
Accuracy: 0.82
fMeasure: 0.8007300232766211

elasticNetParam still 0.8, after trying to set through fit(..., 
params={"elasticNetParam" : 0.3}
LogisticRegressionModel: uid = LogisticRegression_2348ddb9c9f6, numClasses = 3, 
numFeatures = 4
Accuracy: 0.82
fMeasure: 0.8007300232766211

Correct results for elasticNetParam = 0.3, set through __init__
LogisticRegressionModel: uid = LogisticRegression_cb376c90572e, numClasses = 3, 
numFeatures = 4
Accuracy: 0.8933
fMeasure: 0.8922558922558923
{noformat}



was (Author: johnhbauer):
I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params=...) was used.  Updated 
example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer edited comment on SPARK-29691 at 11/4/19 4:57 PM:
-

I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params=...) was used.  Updated 
example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}



was (Author: johnhbauer):
I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params={"elasticNetParam": 0.3} was 
used.  Updated example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer edited comment on SPARK-29691 at 11/4/19 4:57 PM:
-

I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(params={"elasticNetParam": 0.3} was 
used.  Updated example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}



was (Author: johnhbauer):
I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(... , params={ ... } was used.  
Updated example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer edited comment on SPARK-29691 at 11/4/19 4:56 PM:
-

I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(... , params={ ... } was used.  
Updated example:

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}



was (Author: johnhbauer):
I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(... , params={ ... } was used.  
Updated example:

{code:python}
def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer edited comment on SPARK-29691 at 11/4/19 4:55 PM:
-

I was using this in the context of an MLflow Hyperparameter search. None of the 
model outputs changed whatsoever when fit(... , params={ ... } was used.  
Updated example:

{code:python}
def printStats(lrModel, title):
trainingSummary = lrModel.summary
print(title)
print(lrModel)
print("Accuracy:", trainingSummary.accuracy)
print("fMeasure:", trainingSummary.weightedFMeasure())
print("")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
printStats(lrModel, "elasticNetParam = 0.8, set through __init__")

lrModel1 = lr.fit(training, params={"elasticNetParam": 0.3})
printStats(lrModel1, """elasticNetParam still 0.8, after trying to set through 
fit(..., params={"elasticNetParam" : 0.3}""")

lr03 = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.3)
lrModel03 = lr03.fit(training)
printStats(lrModel03, """Correct results for elasticNetParam = 0.3, set through 
__init__""")
{code}



was (Author: johnhbauer):
I will update the example shortly - I was using this in the context of an 
MLflow Hyperparameter search, and none of the model outputs changed either.

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-04 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966768#comment-16966768
 ] 

John Bauer commented on SPARK-29691:


I will update the example shortly - I was using this in the context of an 
MLflow Hyperparameter search, and none of the model outputs changed either.

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT

2019-10-31 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964481#comment-16964481
 ] 

John Bauer commented on SPARK-12806:


Also, when using PyArrow to convert a Spark DataFrame for use in a pandas_udf, 
as soon as a VectorUDT is encountered it reverts to a non-optimized conversion, 
losing much of the advantage of using PyArrow. 

> Support SQL expressions extracting values from VectorUDT
> 
>
> Key: SPARK-12806
> URL: https://issues.apache.org/jira/browse/SPARK-12806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 1.6.0
>Reporter: Feynman Liang
>Priority: Major
>  Labels: bulk-closed
>
> Use cases exist where a specific index within a {{VectorUDT}} column of a 
> {{DataFrame}} is required. For example, we may be interested in extracting a 
> specific class probability from the {{probabilityCol}} of a 
> {{LogisticRegression}} to compute losses. However, if {{probability}} is a 
> column of {{df}} with type {{VectorUDT}}, the following code fails:
> {code}
> df.select("probability.0")
> AnalysisException: u"Can't extract value from probability"
> {code}
> thrown from 
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}.
> {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it 
> to support value extraction Expressions in an analogous way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT

2019-10-31 Thread John Bauer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964475#comment-16964475
 ] 

John Bauer commented on SPARK-12806:


This is still a problem.  For example, classification models emit probability 
as a VectorUDT, which are unusable in PySpark.  This makes constructing 
boosting/bagging algorithms or even just using them as additional features in a 
second model problematic.

> Support SQL expressions extracting values from VectorUDT
> 
>
> Key: SPARK-12806
> URL: https://issues.apache.org/jira/browse/SPARK-12806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 1.6.0
>Reporter: Feynman Liang
>Priority: Major
>  Labels: bulk-closed
>
> Use cases exist where a specific index within a {{VectorUDT}} column of a 
> {{DataFrame}} is required. For example, we may be interested in extracting a 
> specific class probability from the {{probabilityCol}} of a 
> {{LogisticRegression}} to compute losses. However, if {{probability}} is a 
> column of {{df}} with type {{VectorUDT}}, the following code fails:
> {code}
> df.select("probability.0")
> AnalysisException: u"Can't extract value from probability"
> {code}
> thrown from 
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}.
> {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it 
> to support value extraction Expressions in an analogous way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-29691:
---
Description: 
Estimator `fit` method is supposed to copy a dictionary of params, overwriting 
the estimator's previous values, before fitting the model. However, the 
parameter values are not updated.  This was observed in PySpark, but may be 
present in the Java objects, as the PySpark code appears to be functioning 
correctly.   (The copy method that interacts with Java is actually implemented 
in Params.)

For example, this prints

Before: 0.8
After: 0.8

but After should be 0.75

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}

  was:
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

Before: 0.8
After: 0.8

but After should be 0.75

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-29691:
---
Description: 
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

Before: 0.8
After: 0.8

but After should be 0.75

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}

  was:
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75


{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method (implemented in Params) is supposed to copy a 
> dictionary of params, overwriting the estimator's previous values, before 
> fitting the model.  However, the parameter values are not updated.  This was 
> observed in PySpark, but may be present in the Java objects, as the PySpark 
> code appears to be functioning correctly.
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-29691:
---
Description: 
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75


{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}

  was:
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75

{{from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))}}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method (implemented in Params) is supposed to copy a 
> dictionary of params, overwriting the estimator's previous values, before 
> fitting the model.  However, the parameter values are not updated.  This was 
> observed in PySpark, but may be present in the Java objects, as the PySpark 
> code appears to be functioning correctly.
> For example, this prints
> {{Before: 0.8
> After: 0.8}}
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)
John Bauer created SPARK-29691:
--

 Summary: Estimator fit method fails to copy params (in PySpark)
 Key: SPARK-29691
 URL: https://issues.apache.org/jira/browse/SPARK-29691
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: John Bauer


Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75

{{from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2019-06-04 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855948#comment-16855948
 ] 

John Bauer edited comment on SPARK-17025 at 6/4/19 11:12 PM:
-

[~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a 
PySpark estimator/model pair which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] 
It imputes missing values from a normal distribution, using mean and standard 
deviation parameters estimated from the data, so it might be useful for that 
too.  
Let me know if it helps you,


was (Author: johnhbauer):
[~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a 
PySpark estimator/model pair which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2019-06-04 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855948#comment-16855948
 ] 

John Bauer edited comment on SPARK-17025 at 6/4/19 11:10 PM:
-

[~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a 
PySpark estimator/model pair which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,


was (Author: johnhbauer):
[~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair 
which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2019-06-04 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855948#comment-16855948
 ] 

John Bauer commented on SPARK-17025:


[~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair 
which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer edited comment on SPARK-21542 at 11/9/18 8:07 PM:
-

Compared to the previous, the above example is a) much more minimal, b) 
genuinely useful, and c) actually works with save and load, for example:
{code:java}
impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("impute_model")
impm.explainParams(){code}


was (Author: johnhbauer):
This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("impute_model")
impm.explainParams(){code}

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer edited comment on SPARK-21542 at 11/9/18 7:56 PM:
-

This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("impute_model")
impm.explainParams(){code}


was (Author: johnhbauer):
This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
 imp = ImputeNormal.load("impute")
 imp.explainParams()
 impute_model.write().save("impute_model")
 impm = ImputeNormalModel.load("imputer_model")
 impm = ImputeNormalModel.load("impute_model")
 impm.getInputCol()
 impm.getOutputCol()
 impm.getMean()
 impm.getStddev(){code}

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer commented on SPARK-21542:


This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:

impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("imputer_model")
impm = ImputeNormalModel.load("impute_model")
impm.getInputCol()
impm.getOutputCol()
impm.getMean()
impm.getStddev()

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681895#comment-16681895
 ] 

John Bauer edited comment on SPARK-21542 at 11/9/18 7:54 PM:
-

This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:
{code:java}
impute.write().save("impute")
 imp = ImputeNormal.load("impute")
 imp.explainParams()
 impute_model.write().save("impute_model")
 impm = ImputeNormalModel.load("imputer_model")
 impm = ImputeNormalModel.load("impute_model")
 impm.getInputCol()
 impm.getOutputCol()
 impm.getMean()
 impm.getStddev(){code}


was (Author: johnhbauer):
This is a) much more minimal, b) genuinely useful, and c) actually works with 
save and load, for example:

impute.write().save("impute")
imp = ImputeNormal.load("impute")
imp.explainParams()
impute_model.write().save("impute_model")
impm = ImputeNormalModel.load("imputer_model")
impm = ImputeNormalModel.load("impute_model")
impm.getInputCol()
impm.getOutputCol()
impm.getMean()
impm.getStddev()

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-11-09 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681891#comment-16681891
 ] 

John Bauer commented on SPARK-21542:


{code}
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, randn

from pyspark import keyword_only
from pyspark.ml import Estimator, Model
#from pyspark.ml.feature import SQLTransformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param import Param, Params, TypeConverters
from pyspark.ml.param.shared import HasInputCol, HasOutputCol

spark = SparkSession\
.builder\
.appName("ImputeNormal")\
.getOrCreate()

class ImputeNormal(Estimator,
   HasInputCol,
   HasOutputCol,
   DefaultParamsReadable,
   DefaultParamsWritable,
   ):
@keyword_only
def __init__(self, inputCol="inputCol", outputCol="outputCol"):
super(ImputeNormal, self).__init__()

self._setDefault(inputCol="inputCol", outputCol="outputCol")
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol="inputCol", outputCol="outputCol"):
"""
setParams(self, inputCol="inputCol", outputCol="outputCol")
"""
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def _fit(self, data):
inputCol = self.getInputCol()
outputCol = self.getOutputCol()

stats = data.select(inputCol).describe()
mean = stats.where(col("summary") == "mean").take(1)[0][inputCol]
stddev = stats.where(col("summary") == "stddev").take(1)[0][inputCol]

return ImputeNormalModel(mean=float(mean),
 stddev=float(stddev),
 inputCol=inputCol,
 outputCol=outputCol,
 )
# FOR A TRULY MINIMAL BUT LESS DIDACTICALLY EFFECTIVE DEMO, DO INSTEAD:
#sql_text = "SELECT *, IF({inputCol} IS NULL, {stddev} * randn() + 
{mean}, {inputCol}) AS {outputCol} FROM __THIS__"
#
#return SQLTransformer(statement=sql_text.format(stddev=stddev, 
mean=mean, inputCol=inputCol, outputCol=outputCol))
   
class ImputeNormalModel(Model,
HasInputCol,
HasOutputCol,
DefaultParamsReadable,
DefaultParamsWritable,
):

mean = Param(Params._dummy(), "mean", "Mean value of imputations. 
Calculated by fit method.",
  typeConverter=TypeConverters.toFloat)

stddev = Param(Params._dummy(), "stddev", "Standard deviation of 
imputations. Calculated by fit method.",
  typeConverter=TypeConverters.toFloat)


@keyword_only
def __init__(self, mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol"):
super(ImputeNormalModel, self).__init__()

self._setDefault(mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol")
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol"):
"""
setParams(self, mean=0.0, stddev=1.0, inputCol="inputCol", 
outputCol="outputCol")
"""
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def getMean(self):
return self.getOrDefault(self.mean)

def setMean(self, mean):
self._set(mean=mean)

def getStddev(self):
return self.getOrDefault(self.stddev)

def setStddev(self, stddev):
self._set(stddev=stddev)

def _transform(self, data):
mean = self.getMean()
stddev = self.getStddev()
inputCol = self.getInputCol()
outputCol = self.getOutputCol()

df = data.withColumn(outputCol,
 when(col(inputCol).isNull(),
  stddev * randn() + mean).\
  otherwise(col(inputCol)))
return df

if __name__ == "__main__":

train = spark.createDataFrame([[0],[1],[2]] + [[None]]*100,['input'])
impute = ImputeNormal(inputCol='input', outputCol='output')
impute_model = impute.fit(train)
print("Input column: {}".format(impute_model.getInputCol()))
print("Output column: {}".format(impute_model.getOutputCol()))
print("Mean: {}".format(impute_model.getMean()))
print("Standard Deviation: {}".format(impute_model.getStddev()))
test = impute_model.transform(train)
test.show(10)
test.describe().show()
print("mean and stddev for outputCol should be close to those of 
inputCol"){code}
 

> Helper functions for custom Python Persistence
> 

[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-10-01 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634679#comment-16634679
 ] 

John Bauer commented on SPARK-21542:


The above is not as minimal as I would have liked... It is based on the unit 
tests associated with the fix referenced for DefaultParamsReadable, 
DefaultParamsWritable which I thought would test the desired behavior, i.e. 
save and load a pipeline after calling fit().  Unfortunately this was not 
tested, so I flailed at the code for a while until I got something that worked. 
 A lot of stuff left over from setting up unit tests could probably be removed. 
 But at least this seems to work..

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-10-01 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634677#comment-16634677
 ] 

John Bauer commented on SPARK-21542:



{code:python}
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Sep 27 10:25:10 2018

@author: JohnBauer
"""
from pyspark.sql import DataFrame, Row
from pyspark.sql import SQLContext 
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import udf

from pyspark import keyword_only, SparkContext
from pyspark.ml import Estimator, Model, Pipeline, PipelineModel, Transformer, 
UnaryTransformer

from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
#from pyspark.ml.util import *
from pyspark.ml.param import Param, Params, TypeConverters
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.sql.types import FloatType, DoubleType 
#, LongType, ArrayType, StringType, StructType, StructField

spark = SparkSession\
.builder\
.appName("Minimal_1")\
.getOrCreate()

data_path = "/Users/JohnBauer/spark/data/mllib"
# Load training data
data = 
spark.read.format("libsvm").load("{}/sample_libsvm_data.txt".format(data_path))
train, test = data.randomSplit([0.7, 0.3])
train.show(5)


class MockDataset(DataFrame):

def __init__(self):
self.index = 0


class HasFake(Params):

def __init__(self):
super(HasFake, self).__init__()
self.fake = Param(self, "fake", "fake param")

def getFake(self):
return self.getOrDefault(self.fake)


class MockTransformer(Transformer, DefaultParamsReadable, 
DefaultParamsWritable, HasFake):

def __init__(self):
super(MockTransformer, self).__init__()
self.dataset_index = None

def _transform(self, dataset):
self.dataset_index = dataset.index
dataset.index += 1
return dataset


class MockUnaryTransformer(UnaryTransformer,
   DefaultParamsReadable,
   DefaultParamsWritable,):
   #HasInputCol):

shift = Param(Params._dummy(), "shift", "The amount by which to shift " +
  "data in a DataFrame",
  typeConverter=TypeConverters.toFloat)

inputCol = Param(Params._dummy(), "inputCol", "column of DataFrame to 
transform",
  typeConverter=TypeConverters.toString)

outputCol = Param(Params._dummy(), "outputCol", "name of transformed column 
" +
  "to be added to DataFrame",
  typeConverter=TypeConverters.toString)

@keyword_only
def __init__(self, shiftVal=1, inputCol="features", outputCol="outputCol"): 
#, inputCol='features'):
super(MockUnaryTransformer, self).__init__()
self._setDefault(shift=1)
self._set(shift=shiftVal)
self._setDefault(inputCol=inputCol)
self._setDefault(outputCol=outputCol)

def getShift(self):
return self.getOrDefault(self.shift)

def setShift(self, shift):
self._set(shift=shift)

def createTransformFunc(self):
shiftVal = self.getShift()
return lambda x: x + shiftVal

def outputDataType(self):
return DoubleType()

def validateInputType(self, inputType):
if inputType != DoubleType():
print("input type: {}".format(inputType))
return
#raise TypeError("Bad input type: {}. ".format(inputType) +
#"Requires Double.")

def _transform(self, dataset):
shift = self.getOrDefault("shift")

def f(v):
return v + shift

t = FloatType()
out_col = self.getOutputCol()
in_col = dataset[self.getInputCol()]
return dataset.withColumn(out_col, udf(f, t)(in_col))

class MockEstimator(Estimator, DefaultParamsReadable, DefaultParamsWritable, 
HasFake):

def __init__(self):
super(MockEstimator, self).__init__()
self.dataset_index = None

def _fit(self, dataset):
self.dataset_index = dataset.index
model = MockModel()
self._copyValues(model)
return model

class MockModel(MockTransformer, Model, HasFake):
pass


#class PipelineTests(PySparkTestCase):
class PipelineTests(object):

def test_pipeline(self, data=None):
#dataset = MockDataset()
dataset = MockDataset() if data is None else data
estimator0 = MockEstimator()
transformer1 = MockTransformer()
estimator2 = MockEstimator()
transformer3 = MockTransformer()
transformer4 = MockUnaryTransformer(inputCol="label",
outputCol="shifted_label")
pipeline = Pipeline(stages=[estimator0, transformer1, estimator2,
transformer3, transformer4])
pipeline_model = pipeline.fit(dataset,
 

[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-09-11 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611085#comment-16611085
 ] 

John Bauer commented on SPARK-21542:


You don't show your code for __init__ or setParams.  I recall getting this 
error before using the @keyword_only decorator, for example see 
https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml

I will be trying to get my custom transformer pipeline to persist sometime next 
week I hope.  If I succeed, I will try to provide an example if no one else has.

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-10 Thread John Bauer (JIRA)
John Bauer created SPARK-23955:
--

 Summary: typo in parameter name 'rawPredicition'
 Key: SPARK-23955
 URL: https://issues.apache.org/jira/browse/SPARK-23955
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: John Bauer


classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
rawPredicition instead of rawPrediction

also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org