[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang updated SPARK-27592: -------------------------------- Description: We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table: {noformat} spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS; 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. spark-sql> DESC EXTENDED t; c1 int NULL c2 int NULL # Detailed Table Information Database default Table t Owner yumwang Created Time Mon Apr 29 17:52:05 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.0 Type MANAGED Provider parquet Num Buckets 2 Bucket Columns [`c1`] Sort Columns [`c1`] Table Properties [transient_lastDdlTime=1556531525] Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [serialization.format=1] {noformat} We can see incompatible information when creating the table: {noformat} WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. {noformat} But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself. > Write the data of table write information to metadata > ----------------------------------------------------- > > Key: SPARK-27592 > URL: https://issues.apache.org/jira/browse/SPARK-27592 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Yuming Wang > Priority: Major > > We hint Hive using incorrect > InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's > Parquet datasource bucket table: > {noformat} > spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) > SORTED BY (c1) INTO 2 BUCKETS; > 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data > source table `default`.`t` into Hive metastore in Spark SQL specific format, > which is NOT compatible with Hive. > spark-sql> DESC EXTENDED t; > c1 int NULL > c2 int NULL > # Detailed Table Information > Database default > Table t > Owner yumwang > Created Time Mon Apr 29 17:52:05 CST 2019 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.4.0 > Type MANAGED > Provider parquet > Num Buckets 2 > Bucket Columns [`c1`] > Sort Columns [`c1`] > Table Properties [transient_lastDdlTime=1556531525] > Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties [serialization.format=1] > {noformat} > We can see incompatible information when creating the table: > {noformat} > WARN HiveExternalCatalog:66 - Persisting bucketed data source table > `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT > compatible with Hive. > {noformat} > But downstream don’t know the compatibility. I'd like to write the write > information of this table to metadata so that each engine decides > compatibility itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org