[jira] [Commented] (SPARK-33373) A serialized ImputerModel fails to be serialized again
[ https://issues.apache.org/jira/browse/SPARK-33373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227867#comment-17227867 ] L. C. Hsieh commented on SPARK-33373: - Overall a writing operation cannot overwrite existing path if you need to read original data. We prevent such operation except for dynamic overwritting in SQL. I have not checked the query path of the operation in the example, but I suspect it is the root cause. I will try to look at it. > A serialized ImputerModel fails to be serialized again > -- > > Key: SPARK-33373 > URL: https://issues.apache.org/jira/browse/SPARK-33373 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 > Environment: * Python 3.7.3 > * (Py)Spark 2.4.3 >Reporter: Andre Boechat >Priority: Major > > After loading an {{ImputerModel}} from disk, the instance fails to save > itself again. > h2. Code Sample > {code:python} > from pyspark.ml.feature import Imputer, ImputerModel > df = sparksession.createDataFrame( > [ > (2.0, 3.0), > (2.0, 1.0), > (2.0, None), > (None, 0.0) > ], > ["x200", "x3"] > ).repartition(1) > i = Imputer(inputCols=["x200", "x3"], outputCols=["x200_i", "x3_i"]).fit( > df > ) > tdf = i.transform(df) > fpath = "/tmp/bucketpath" > i.write().overwrite().save(fpath) > li = ImputerModel.load(fpath) > t2df = li.transform(df) > assert all( > r1.asDict() == r2.asDict() for r1, r2 in zip( > tdf.collect(), t2df.collect() > ) > ) > # This line makes Spark crash. > li.write().overwrite().save(fpath) > {code} > h2. Stacktrace > {code:python} > --> 480 li.write().overwrite().save(fpath) > > > > > > > > > > /usr/spark-2.4.3/python/pyspark/ml/util.py in save(self, path) > > > 181 if not isinstance(path, basestring): > > > 182 raise TypeError("path should be a basestring, got type > %s" % type(path)) > > --> 183 self._jwrite.save(path) > > > 184 > > > 185 def overwrite(self): > > > > > > > > /usr/local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, > *args) > >1284 answer = self.gateway_client.send_command(command) > > >1285 return_value = get_return_value( > > > > > -> 1286 answer,
[jira] [Commented] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests
[ https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227854#comment-17227854 ] Apache Spark commented on SPARK-33382: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30287 > Unify v1 and v2 SHOW TABLES tests > - > > Key: SPARK-33382 > URL: https://issues.apache.org/jira/browse/SPARK-33382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. > Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests
[ https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33382: Assignee: (was: Apache Spark) > Unify v1 and v2 SHOW TABLES tests > - > > Key: SPARK-33382 > URL: https://issues.apache.org/jira/browse/SPARK-33382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. > Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests
[ https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227853#comment-17227853 ] Apache Spark commented on SPARK-33382: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30287 > Unify v1 and v2 SHOW TABLES tests > - > > Key: SPARK-33382 > URL: https://issues.apache.org/jira/browse/SPARK-33382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. > Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests
[ https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33382: Assignee: Apache Spark > Unify v1 and v2 SHOW TABLES tests > - > > Key: SPARK-33382 > URL: https://issues.apache.org/jira/browse/SPARK-33382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. > Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests
Maxim Gekk created SPARK-33382: -- Summary: Unify v1 and v2 SHOW TABLES tests Key: SPARK-33382 URL: https://issues.apache.org/jira/browse/SPARK-33382 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33381) Unify dsv1 and dsv2 command tests
Maxim Gekk created SPARK-33381: -- Summary: Unify dsv1 and dsv2 command tests Key: SPARK-33381 URL: https://issues.apache.org/jira/browse/SPARK-33381 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW TABLES and etc. Put datasource specific tests to separate test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33381) Unify DSv1 and DSv2 command tests
[ https://issues.apache.org/jira/browse/SPARK-33381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33381: --- Summary: Unify DSv1 and DSv2 command tests (was: Unify dsv1 and dsv2 command tests) > Unify DSv1 and DSv2 command tests > - > > Key: SPARK-33381 > URL: https://issues.apache.org/jira/browse/SPARK-33381 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW > TABLES and etc. Put datasource specific tests to separate test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33373) A serialized ImputerModel fails to be serialized again
[ https://issues.apache.org/jira/browse/SPARK-33373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227829#comment-17227829 ] Andre Boechat edited comment on SPARK-33373 at 11/7/20, 4:01 PM: - [~viirya], yes and no. Here are a few scenarios I've tested: {code:python} # This line works fine. li.write().overwrite().save(fpath + "_new") # But this also works fine ("i" is the original Imputer object). i.write().overwrite().save(fpath) # As it's done in my original example, this fails to execute. li.write().overwrite().save(fpath) {code} Since I'm trying to "overwrite" possibly existent files, shouldn't I expect my example to work? Besides, I've pipelines with other kinds of feature transformers (like "StringIndexer") and they don't have any problem overwriting existent files. This code below works fine: {code:python} s = StringIndexer(inputCol="x200", outputCol="x200_out").fit(df) sipath = "/tmp/stringindexer" s.write().overwrite().save(sipath) ls = StringIndexerModel.load(sipath) ls.write().overwrite().save(sipath) {code} was (Author: boechat107): [~viirya], yes and no. Here are a few scenarios I've tested: {code:python} # This line works fine. li.write().overwrite().save(fpath + "_new") # But this also works fine ("i" is the original Imputer object). i.write().overwrite().save(fpath) {code} Since I'm trying to "overwrite" possibly existent files, shouldn't I expect my example to work? Besides, I've pipelines with other kinds of feature transformers (like "StringIndexer") and they don't have any problem overwriting existent files. This code below works fine: {code:python} s = StringIndexer(inputCol="x200", outputCol="x200_out").fit(df) sipath = "/tmp/stringindexer" s.write().overwrite().save(sipath) ls = StringIndexerModel.load(sipath) ls.write().overwrite().save(sipath) {code} > A serialized ImputerModel fails to be serialized again > -- > > Key: SPARK-33373 > URL: https://issues.apache.org/jira/browse/SPARK-33373 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 > Environment: * Python 3.7.3 > * (Py)Spark 2.4.3 >Reporter: Andre Boechat >Priority: Major > > After loading an {{ImputerModel}} from disk, the instance fails to save > itself again. > h2. Code Sample > {code:python} > from pyspark.ml.feature import Imputer, ImputerModel > df = sparksession.createDataFrame( > [ > (2.0, 3.0), > (2.0, 1.0), > (2.0, None), > (None, 0.0) > ], > ["x200", "x3"] > ).repartition(1) > i = Imputer(inputCols=["x200", "x3"], outputCols=["x200_i", "x3_i"]).fit( > df > ) > tdf = i.transform(df) > fpath = "/tmp/bucketpath" > i.write().overwrite().save(fpath) > li = ImputerModel.load(fpath) > t2df = li.transform(df) > assert all( > r1.asDict() == r2.asDict() for r1, r2 in zip( > tdf.collect(), t2df.collect() > ) > ) > # This line makes Spark crash. > li.write().overwrite().save(fpath) > {code} > h2. Stacktrace > {code:python} > --> 480 li.write().overwrite().save(fpath) > > > > > > > > > > /usr/spark-2.4.3/python/pyspark/ml/util.py in save(self, path) > > > 181 if not isinstance(path, basestring): > > > 182 raise TypeError("path should be a basestring, got type > %s" % type(path)) > > --> 183 self._jwrite.save(path) > > > 184 >
[jira] [Commented] (SPARK-33373) A serialized ImputerModel fails to be serialized again
[ https://issues.apache.org/jira/browse/SPARK-33373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227829#comment-17227829 ] Andre Boechat commented on SPARK-33373: --- [~viirya], yes and no. Here are a few scenarios I've tested: {code:python} # This line works fine. li.write().overwrite().save(fpath + "_new") # But this also works fine ("i" is the original Imputer object). i.write().overwrite().save(fpath) {code} Since I'm trying to "overwrite" possibly existent files, shouldn't I expect my example to work? Besides, I've pipelines with other kinds of feature transformers (like "StringIndexer") and they don't have any problem overwriting existent files. This code below works fine: {code:python} s = StringIndexer(inputCol="x200", outputCol="x200_out").fit(df) sipath = "/tmp/stringindexer" s.write().overwrite().save(sipath) ls = StringIndexerModel.load(sipath) ls.write().overwrite().save(sipath) {code} > A serialized ImputerModel fails to be serialized again > -- > > Key: SPARK-33373 > URL: https://issues.apache.org/jira/browse/SPARK-33373 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 > Environment: * Python 3.7.3 > * (Py)Spark 2.4.3 >Reporter: Andre Boechat >Priority: Major > > After loading an {{ImputerModel}} from disk, the instance fails to save > itself again. > h2. Code Sample > {code:python} > from pyspark.ml.feature import Imputer, ImputerModel > df = sparksession.createDataFrame( > [ > (2.0, 3.0), > (2.0, 1.0), > (2.0, None), > (None, 0.0) > ], > ["x200", "x3"] > ).repartition(1) > i = Imputer(inputCols=["x200", "x3"], outputCols=["x200_i", "x3_i"]).fit( > df > ) > tdf = i.transform(df) > fpath = "/tmp/bucketpath" > i.write().overwrite().save(fpath) > li = ImputerModel.load(fpath) > t2df = li.transform(df) > assert all( > r1.asDict() == r2.asDict() for r1, r2 in zip( > tdf.collect(), t2df.collect() > ) > ) > # This line makes Spark crash. > li.write().overwrite().save(fpath) > {code} > h2. Stacktrace > {code:python} > --> 480 li.write().overwrite().save(fpath) > > > > > > > > > > /usr/spark-2.4.3/python/pyspark/ml/util.py in save(self, path) > > > 181 if not isinstance(path, basestring): > > > 182 raise TypeError("path should be a basestring, got type > %s" % type(path)) > > --> 183 self._jwrite.save(path) > > > 184 > > > 185 def overwrite(self): > > > > > > > > /usr/local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, > *args) > >1284 answer = self.gateway_client.send_command(command) >