[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

2020-08-14 Thread Zirui Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178101#comment-17178101
 ] 

Zirui Xu commented on SPARK-32092:
--

I think CrossValidatorModel.copy() is affected by a similar issue of losing 
numFolds too. I will attempt a fix.

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -
>
> Key: SPARK-32092
> URL: https://issues.apache.org/jira/browse/SPARK-32092
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 2.4.5
> Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>Reporter: An De Rijdt
>Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>for (splitIndex <- 0 until instance.getNumFolds) {
>   val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>   for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
> val modelPath = new Path(splitPath, paramIndex.toString).toString
> 
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>   }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=4,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=2,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...

[jira] [Commented] (SPARK-31776) Literal lit() supports lists and numpy arrays

2020-08-13 Thread Zirui Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177281#comment-17177281
 ] 

Zirui Xu commented on SPARK-31776:
--

Hi [~mengxr], is this ticket suggesting adding support for Python list and 
numpy arrays in both lit() and fillna()? I understand that fillna() does not 
support Array type so adding Python list support to fillna() might involve 
adding Array support from the SQL side first.

> Literal lit() supports lists and numpy arrays
> -
>
> Key: SPARK-31776
> URL: https://issues.apache.org/jira/browse/SPARK-31776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In ML workload, it is common to replace null feature vectors with some 
> default value. However, lit() does not support Python list and numpy arrays 
> at input. Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org