[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167276#comment-16167276 ] Xiayun Sun commented on SPARK-21994: I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {{ scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ }} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167280#comment-16167280 ] Jia-Xuan Liu commented on SPARK-21994: -- I also can't reproduce this in Spark 2.2 release. {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446512312). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test_2") scala> spark.sql("select * from test.spark22_test_2").show() ++ |databaseName| ++ | default| |test| ++ {code} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167407#comment-16167407 ] Jurgis Pods commented on SPARK-21994: - Thank you for testing. Which version of Hive are you using? It might be an incompatibility between Spark 2.2 and Hive 1.1 (or other components of on Cloudera CDH 5.10.1). I will upgrade to latest CDH 5.12 and report back if the problem persists. > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167788#comment-16167788 ] Jurgis Pods commented on SPARK-21994: - I have updated to CDH 5.12.1 and the problem persists. There is an existing topic on the Cloudera forums with exactly this problem: http://community.cloudera.com/t5/forums/replypage/board-id/Spark/message-id/2867 > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246463#comment-16246463 ] Srinivasa Reddy Vundela commented on SPARK-21994: - This issue is related to Cloudera spark and got fixed recently. We can close this jira. > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247403#comment-16247403 ] Guillaume Van Delsen commented on SPARK-21994: -- [~vsr] Good news, could share Cloudera logchange about this fix ? Did not find it anywhere. Thanks! > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250032#comment-16250032 ] Srinivasa Reddy Vundela commented on SPARK-21994: - commit d5e3ba3e970c7241298db2578f0d7965b6e16ae3 Author: Srinivasa Reddy Vundela Date: Mon Oct 9 14:25:01 2017 -0700 CDH-60037. Not able to read hive table from Cloudera version of Spark 2.2 > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250039#comment-16250039 ] Sean Owen commented on SPARK-21994: --- (Don't think that would be meaningful outside Cloudera at the moment; the commit doesn't exist in the public release/repo yet) > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250066#comment-16250066 ] Srinivasa Reddy Vundela commented on SPARK-21994: - [~srowen] Thats right, it is not available in public release yet. I just posted for reference. > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299780#comment-16299780 ] Jurgis Pods commented on SPARK-21994: - [~vsr], do you know when the mentioned fix in Cloudera will make it to the next release? The Cloudera documentation does not list a new version of Spark 2.2 as of yet: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_packaging.html > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923450#comment-16923450 ] Tomasz Belina commented on SPARK-21994: --- I've experienced the same issue on spark 2.4.3 > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods >Priority: Major > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org