[jira] [Commented] (SPARK-33373) A serialized ImputerModel fails to be serialized again

2020-11-07 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227867#comment-17227867
 ] 

L. C. Hsieh commented on SPARK-33373:
-

Overall a writing operation cannot overwrite existing path if you need to read 
original data. We prevent such operation except for dynamic overwritting in 
SQL. I have not checked the query path of the operation in the example, but I 
suspect it is the root cause. I will try to look at it.

> A serialized ImputerModel fails to be serialized again
> --
>
> Key: SPARK-33373
> URL: https://issues.apache.org/jira/browse/SPARK-33373
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
> Environment: * Python 3.7.3
>  * (Py)Spark 2.4.3
>Reporter: Andre Boechat
>Priority: Major
>
> After loading an {{ImputerModel}} from disk, the instance fails to save 
> itself again.
> h2. Code Sample
> {code:python}
> from pyspark.ml.feature import Imputer, ImputerModel
> df = sparksession.createDataFrame(
> [
> (2.0, 3.0),
> (2.0, 1.0),
> (2.0, None),
> (None, 0.0)
> ],
> ["x200", "x3"]
> ).repartition(1)
> i = Imputer(inputCols=["x200", "x3"], outputCols=["x200_i", "x3_i"]).fit(
> df
> )
> tdf = i.transform(df)
> fpath = "/tmp/bucketpath"
> i.write().overwrite().save(fpath)
> li = ImputerModel.load(fpath)
> t2df = li.transform(df)
> assert all(
> r1.asDict() == r2.asDict() for r1, r2 in zip(
> tdf.collect(), t2df.collect()
> )
> )
> # This line makes Spark crash.
> li.write().overwrite().save(fpath)
> {code}
> h2. Stacktrace
> {code:python}
> --> 480 li.write().overwrite().save(fpath)
>   
>   
>   
>
>   
>   
>   
>   
>
> /usr/spark-2.4.3/python/pyspark/ml/util.py in save(self, path)
>   
>  
> 181 if not isinstance(path, basestring):  
>   
>  
> 182 raise TypeError("path should be a basestring, got type 
> %s" % type(path)) 
> 
> --> 183 self._jwrite.save(path)   
>   
>  
> 184   
>   
>  
> 185 def overwrite(self):  
>   
>   
>   
>
>   
>   
>  
> /usr/local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, 
> *args)
>  
>1284 answer = self.gateway_client.send_command(command)
>   
>  
>1285 return_value = get_return_value(  
>   
>   
>   
>
> -> 1286 answer, 

[jira] [Commented] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227854#comment-17227854
 ] 

Apache Spark commented on SPARK-33382:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30287

> Unify v1 and v2 SHOW TABLES tests
> -
>
> Key: SPARK-33382
> URL: https://issues.apache.org/jira/browse/SPARK-33382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. 
> Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33382:


Assignee: (was: Apache Spark)

> Unify v1 and v2 SHOW TABLES tests
> -
>
> Key: SPARK-33382
> URL: https://issues.apache.org/jira/browse/SPARK-33382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. 
> Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227853#comment-17227853
 ] 

Apache Spark commented on SPARK-33382:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30287

> Unify v1 and v2 SHOW TABLES tests
> -
>
> Key: SPARK-33382
> URL: https://issues.apache.org/jira/browse/SPARK-33382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. 
> Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33382:


Assignee: Apache Spark

> Unify v1 and v2 SHOW TABLES tests
> -
>
> Key: SPARK-33382
> URL: https://issues.apache.org/jira/browse/SPARK-33382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. 
> Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33382:
--

 Summary: Unify v1 and v2 SHOW TABLES tests
 Key: SPARK-33382
 URL: https://issues.apache.org/jira/browse/SPARK-33382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. Mix 
this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33381) Unify dsv1 and dsv2 command tests

2020-11-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33381:
--

 Summary: Unify dsv1 and dsv2 command tests
 Key: SPARK-33381
 URL: https://issues.apache.org/jira/browse/SPARK-33381
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW 
TABLES and etc. Put datasource specific tests to separate test suites. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33381) Unify DSv1 and DSv2 command tests

2020-11-07 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33381:
---
Summary: Unify DSv1 and DSv2 command tests  (was: Unify dsv1 and dsv2 
command tests)

> Unify DSv1 and DSv2 command tests
> -
>
> Key: SPARK-33381
> URL: https://issues.apache.org/jira/browse/SPARK-33381
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW 
> TABLES and etc. Put datasource specific tests to separate test suites. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33373) A serialized ImputerModel fails to be serialized again

2020-11-07 Thread Andre Boechat (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227829#comment-17227829
 ] 

Andre Boechat edited comment on SPARK-33373 at 11/7/20, 4:01 PM:
-

[~viirya], yes and no. Here are a few scenarios I've tested:

{code:python}
# This line works fine.
li.write().overwrite().save(fpath + "_new")

# But this also works fine ("i" is the original Imputer object).
i.write().overwrite().save(fpath)

# As it's done in my original example, this fails to execute.
li.write().overwrite().save(fpath)
{code}

Since I'm trying to "overwrite" possibly existent files, shouldn't I expect my 
example to work? Besides, I've pipelines with other kinds of feature 
transformers (like "StringIndexer") and they don't have any problem overwriting 
existent files.

This code below works fine:

{code:python}
s = StringIndexer(inputCol="x200", outputCol="x200_out").fit(df)
sipath = "/tmp/stringindexer"
s.write().overwrite().save(sipath)

ls = StringIndexerModel.load(sipath)
ls.write().overwrite().save(sipath)
{code}



was (Author: boechat107):
[~viirya], yes and no. Here are a few scenarios I've tested:

{code:python}
# This line works fine.
li.write().overwrite().save(fpath + "_new")

# But this also works fine ("i" is the original Imputer object).
i.write().overwrite().save(fpath)
{code}

Since I'm trying to "overwrite" possibly existent files, shouldn't I expect my 
example to work? Besides, I've pipelines with other kinds of feature 
transformers (like "StringIndexer") and they don't have any problem overwriting 
existent files.

This code below works fine:

{code:python}
s = StringIndexer(inputCol="x200", outputCol="x200_out").fit(df)
sipath = "/tmp/stringindexer"
s.write().overwrite().save(sipath)

ls = StringIndexerModel.load(sipath)
ls.write().overwrite().save(sipath)
{code}


> A serialized ImputerModel fails to be serialized again
> --
>
> Key: SPARK-33373
> URL: https://issues.apache.org/jira/browse/SPARK-33373
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
> Environment: * Python 3.7.3
>  * (Py)Spark 2.4.3
>Reporter: Andre Boechat
>Priority: Major
>
> After loading an {{ImputerModel}} from disk, the instance fails to save 
> itself again.
> h2. Code Sample
> {code:python}
> from pyspark.ml.feature import Imputer, ImputerModel
> df = sparksession.createDataFrame(
> [
> (2.0, 3.0),
> (2.0, 1.0),
> (2.0, None),
> (None, 0.0)
> ],
> ["x200", "x3"]
> ).repartition(1)
> i = Imputer(inputCols=["x200", "x3"], outputCols=["x200_i", "x3_i"]).fit(
> df
> )
> tdf = i.transform(df)
> fpath = "/tmp/bucketpath"
> i.write().overwrite().save(fpath)
> li = ImputerModel.load(fpath)
> t2df = li.transform(df)
> assert all(
> r1.asDict() == r2.asDict() for r1, r2 in zip(
> tdf.collect(), t2df.collect()
> )
> )
> # This line makes Spark crash.
> li.write().overwrite().save(fpath)
> {code}
> h2. Stacktrace
> {code:python}
> --> 480 li.write().overwrite().save(fpath)
>   
>   
>   
>
>   
>   
>   
>   
>
> /usr/spark-2.4.3/python/pyspark/ml/util.py in save(self, path)
>   
>  
> 181 if not isinstance(path, basestring):  
>   
>  
> 182 raise TypeError("path should be a basestring, got type 
> %s" % type(path)) 
> 
> --> 183 self._jwrite.save(path)   
>   
>  
> 184   
> 

[jira] [Commented] (SPARK-33373) A serialized ImputerModel fails to be serialized again

2020-11-07 Thread Andre Boechat (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227829#comment-17227829
 ] 

Andre Boechat commented on SPARK-33373:
---

[~viirya], yes and no. Here are a few scenarios I've tested:

{code:python}
# This line works fine.
li.write().overwrite().save(fpath + "_new")

# But this also works fine ("i" is the original Imputer object).
i.write().overwrite().save(fpath)
{code}

Since I'm trying to "overwrite" possibly existent files, shouldn't I expect my 
example to work? Besides, I've pipelines with other kinds of feature 
transformers (like "StringIndexer") and they don't have any problem overwriting 
existent files.

This code below works fine:

{code:python}
s = StringIndexer(inputCol="x200", outputCol="x200_out").fit(df)
sipath = "/tmp/stringindexer"
s.write().overwrite().save(sipath)

ls = StringIndexerModel.load(sipath)
ls.write().overwrite().save(sipath)
{code}


> A serialized ImputerModel fails to be serialized again
> --
>
> Key: SPARK-33373
> URL: https://issues.apache.org/jira/browse/SPARK-33373
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
> Environment: * Python 3.7.3
>  * (Py)Spark 2.4.3
>Reporter: Andre Boechat
>Priority: Major
>
> After loading an {{ImputerModel}} from disk, the instance fails to save 
> itself again.
> h2. Code Sample
> {code:python}
> from pyspark.ml.feature import Imputer, ImputerModel
> df = sparksession.createDataFrame(
> [
> (2.0, 3.0),
> (2.0, 1.0),
> (2.0, None),
> (None, 0.0)
> ],
> ["x200", "x3"]
> ).repartition(1)
> i = Imputer(inputCols=["x200", "x3"], outputCols=["x200_i", "x3_i"]).fit(
> df
> )
> tdf = i.transform(df)
> fpath = "/tmp/bucketpath"
> i.write().overwrite().save(fpath)
> li = ImputerModel.load(fpath)
> t2df = li.transform(df)
> assert all(
> r1.asDict() == r2.asDict() for r1, r2 in zip(
> tdf.collect(), t2df.collect()
> )
> )
> # This line makes Spark crash.
> li.write().overwrite().save(fpath)
> {code}
> h2. Stacktrace
> {code:python}
> --> 480 li.write().overwrite().save(fpath)
>   
>   
>   
>
>   
>   
>   
>   
>
> /usr/spark-2.4.3/python/pyspark/ml/util.py in save(self, path)
>   
>  
> 181 if not isinstance(path, basestring):  
>   
>  
> 182 raise TypeError("path should be a basestring, got type 
> %s" % type(path)) 
> 
> --> 183 self._jwrite.save(path)   
>   
>  
> 184   
>   
>  
> 185 def overwrite(self):  
>   
>   
>   
>
>   
>   
>  
> /usr/local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, 
> *args)
>  
>1284 answer = self.gateway_client.send_command(command)
>