[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

Kayal (Jira) Thu, 13 Aug 2020 07:49:33 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177073#comment-17177073
 ]


Kayal commented on SPARK-32053:
-------------------------------

The code to reproduce the issue on windows jupyter notebook:

import pyspark
#from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext("local", "First App")
from pyspark.sql import SparkSession
sess = SparkSession(sc)

training = sess.createDataFrame([
 ("0L", "a b c d e WML", 1.0),
 ("1L", "b d", 0.0),
 ("2L", "WML f g h", 1.0),
 ("3L", "hadoop mapreduce", 0.0)], ["id", "text", "label"])

evaluation = sess.createDataFrame([
 ("4L", "a b c WML", 1.0),
 ("5L", "l m n o p", 0.0),
 ("6L", "WML g h i k", 1.0),
 ("7L", "apache hadoop zuzu", 0.0)], ["id", "text", "label"])

testing = sess.createDataFrame([
 ("4L", "a b c z WML"),
 ("5L", "l m n"),
 ("6L", "WML g h i j k"),
 ("7L", "apache hadoop")], ["id", "text"])
import traceback
from pyspark.ml.pipeline import Pipeline


from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext as sql_context

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
stages=[tokenizer, hashingTF, lr]
pipeline = Pipeline(stages=stages)
model = pipeline.fit(training)
test_result = model.transform(testing)

pipeline.write().overwrite().save("tempfile")

 

The write operation is failing with the error that I mentioned above. This is 
blocking our product delivery.  could consider this with high priority blocker 
issue. Is there a work around for this ?  sparkml is supported on windows 
pyspark ? 

I also noticed the same error with 

pipline.save() method.

 

 

> pyspark save of serialized model is failing for windows.
> --------------------------------------------------------
>
>                 Key: SPARK-32053
>                 URL: https://issues.apache.org/jira/browse/SPARK-32053
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Kayal
>            Priority: Major
>         Attachments: image-2020-06-22-18-19-32-236.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

Reply via email to