[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281170#comment-16281170 ] Dongjoon Hyun commented on SPARK-15705: --- Since 2.2.1 is released, I'll update the result from 2.2.1, too. {code} scala> sql("set spark.sql.hive.convertMetastoreOrc=true") scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> spark.version res2: String = 2.2.1 {code} > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Assignee: Yin Huai >Priority: Critical > Fix For: 2.0.0 > > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163747#comment-16163747 ] Dongjoon Hyun commented on SPARK-15705: --- Hi, All. I'm tracking this bug. This seems to be fixed since 2.1.1. {code} scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> sql("set spark.sql.hive.convertMetastoreOrc=true") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.table("default.test").printSchema root |-- _col0: long (nullable = true) |-- _col1: string (nullable = true) |-- state: string (nullable = true) scala> sc.version res3: String = 2.0.2 {code} {code} scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> sql("set spark.sql.hive.convertMetastoreOrc=true") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> sc.version res3: String = 2.1.1 {code} {code} scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> sql("set spark.sql.hive.convertMetastoreOrc=true") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.table("default.test").printSchema root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) scala> sc.version res3: String = 2.2.0 {code} > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Assignee: Yin Huai >Priority: Critical > Fix For: 2.0.0 > > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384648#comment-15384648 ] Yin Huai commented on SPARK-15705: -- The proper fix will be tracked by https://issues.apache.org/jira/browse/SPARK-16628. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Assignee: Yin Huai >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384604#comment-15384604 ] Apache Spark commented on SPARK-15705: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/14267 > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Assignee: Yin Huai >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384593#comment-15384593 ] Yin Huai commented on SPARK-15705: -- I will send a PR to change the default value of {{spark.sql.hive.convertMetastoreOrc}}. Then, I will start to look at the fix. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Assignee: Yin Huai >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384464#comment-15384464 ] Michael Allman commented on SPARK-15705: Yeah, I'm not too keen on that code path either. Inferring the schema from reading every file in a partitioned table is a relatively heavyweight and slow operation. We're not leveraging the fact that we have a metastore schema. This is a performance/efficiency issue I've been working on in our own codebase, and I believe we'll have something to contribute in the near future. However, that will definitely not make it into 2.0. As for your problem specifically, I can't really suggest a quick fix because I have no experience with the ORC file format. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375543#comment-15375543 ] Nic Eggert commented on SPARK-15705: I believe the problem is around here: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L301 You can see that there's special handling for merging parquet and metastore schema, but for anything else, the schema is just taken from the file. I don't really have the expertise to fix this on my own, but I'm happy to help in any way I can. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373786#comment-15373786 ] Nic Eggert commented on SPARK-15705: Found a work-around: Set spark.sql.hive.convertMetastoreOrc=false. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371513#comment-15371513 ] Nic Eggert commented on SPARK-15705: Double-checked just for fun. The problem still exists in RC2. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345533#comment-15345533 ] Jeff Zhang commented on SPARK-15705: I will take a look at it. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345036#comment-15345036 ] Nic Eggert commented on SPARK-15705: Raised priority to critical. We have a lot of partitioned tables stored as ORC, so dataframes/datasets in 2.0 are effectively useless to us until this is fixed. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313257#comment-15313257 ] Xin Wu commented on SPARK-15705: I can recreate it now. and will look into it. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org