[ https://issues.apache.org/jira/browse/SPARK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Xu resolved SPARK-36440. ------------------------------ Fix Version/s: 3.2.0 Resolution: Fixed Already fixed by https://issues.apache.org/jira/browse/SPARK-36197 for future versions. > Spark3 fails to read hive table with mixed format > ------------------------------------------------- > > Key: SPARK-36440 > URL: https://issues.apache.org/jira/browse/SPARK-36440 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0, 3.1.1, 3.1.2 > Reporter: Jason Xu > Priority: Major > Fix For: 3.2.0 > > > Spark3 fails to read hive table with mixed format with hive Serde, this is a > regression compares to Spark 2.4. > Replication steps : > 1. In spark 3 (3.0 or 3.1) spark shell: > {code:java} > scala> spark.sql("create table tmp.test_table (id int, name string) > partitioned by (pt int) stored as rcfile") > scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), > (2, 'Bob')") > {code} > 2. Run hive command to change table file format (from RCFile to Parquet). > {code:java} > hive (default)> alter table set tmp.test_table fileformat Parquet; > {code} > 3. Try to read partition (in RCFile format) with hive serde using Spark shell: > {code:java} > scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false") > scala> spark.sql("select * from tmp.test_table where pt=1").show{code} > Exception: (anonymized file path with <path>) > {code:java} > Caused by: java.lang.RuntimeException: > s3a://<path>/data/part-00000-22112178-5dd7-4065-89d7-2ee550296909-c000 is not > a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, > 96, 1, -33] > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433) > at > org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) > at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286) > at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:285) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > <omitted more logs>{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org