[
https://issues.apache.org/jira/browse/HUDI-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lin Liu reassigned HUDI-9344:
-----------------------------
Assignee: Lin Liu
> Hudi Fails to Read Data in ORC File Format
> ------------------------------------------
>
> Key: HUDI-9344
> URL: https://issues.apache.org/jira/browse/HUDI-9344
> Project: Apache Hudi
> Issue Type: Bug
> Components: spark-sql
> Reporter: Ranga Reddy
> Assignee: Lin Liu
> Priority: Critical
> Fix For: 1.1.0
>
>
> Apache Hudi fails to read the ORC file format when both
> *hoodie.base.file.format* and *hoodie.table.base.file.format* are
> specified.{{{}{}}}
> *Exception:*
> {{}}
> {code:java}
> java.lang.RuntimeException:
> s3a://warehouse/hudi_orc_test_table1/age=7/6dc054f7-3cfb-4c84-bb93-df0faf4b262a-0_0-11-13_20250428085736017.orc
> is not a Parquet file. Expected magic number at tail, but found [79, 82, 67,
> 25]
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
> at
> org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
> at
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
> at
> org.apache.spark.sql.execution.datasources.parquet.Spark35ParquetReader.doRead(Spark35ParquetReader.scala:101)
> at
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:80)
> at
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.readBaseFile(HoodieFileGroupReaderBasedParquetFileFormat.scala:286)
> at
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:204)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
> at
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
> at
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
> at
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
> at
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
> at org.apache.spark.scheduler.Task.run(Task.scala:141)
> at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
> at
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
> at
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829){code}
> {{{}{}}}{*}Sample code to reproduce this issue:{*}
>
> {code:java}
> CREATE TABLE hudi_orc_test_table1 (
> `id` BIGINT,
> `name` STRING,
> `age` INT,
> `salary` DOUBLE
> ) USING hudi
> PARTITIONED BY (age)
> LOCATION 's3a://warehouse/hudi_orc_test_table1'
> TBLPROPERTIES (
> 'hoodie.base.file.format' = 'ORC',
> 'hoodie.table.base.file.format' = 'ORC',
> 'type' = 'cow');
> INSERT INTO hudi_orc_test_table1 (id, name, age, salary) VALUES (1, 'ranga',
> 34, 10000),(2, 'nishanth', 7, 300000);
> SELECT * FROM hudi_orc_test_table1;
> {code}
>
> Refer the following Hudi issue for more details:
> https://github.com/apache/hudi/issues/13221
--
This message was sent by Atlassian Jira
(v8.20.10#820010)