Ranga Reddy created HUDI-9344:
---------------------------------
Summary: Hudi Fails to Read Data in ORC File Format
Key: HUDI-9344
URL: https://issues.apache.org/jira/browse/HUDI-9344
Project: Apache Hudi
Issue Type: Bug
Components: spark-sql
Reporter: Ranga Reddy
Fix For: 1.1.0
Apache Hudi fails to read the ORC file format when both
*hoodie.base.file.format* and *hoodie.table.base.file.format* are
specified.{{{}{}}}
*Exception:*
{{}}
{code:java}
java.lang.RuntimeException:
s3a://warehouse/hudi_orc_test_table1/age=7/6dc054f7-3cfb-4c84-bb93-df0faf4b262a-0_0-11-13_20250428085736017.orc
is not a Parquet file. Expected magic number at tail, but found [79, 82, 67,
25]
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
at
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
at
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
at
org.apache.spark.sql.execution.datasources.parquet.Spark35ParquetReader.doRead(Spark35ParquetReader.scala:101)
at
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:80)
at
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.readBaseFile(HoodieFileGroupReaderBasedParquetFileFormat.scala:286)
at
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:204)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829){code}
{{{}{}}}{*}Sample code to reproduce this issue:{*}
{code:java}
CREATE TABLE hudi_orc_test_table1 (
`id` BIGINT,
`name` STRING,
`age` INT,
`salary` DOUBLE
) USING hudi
PARTITIONED BY (age)
LOCATION 's3a://warehouse/hudi_orc_test_table1'
TBLPROPERTIES (
'hoodie.base.file.format' = 'ORC',
'hoodie.table.base.file.format' = 'ORC',
'type' = 'cow');
INSERT INTO hudi_orc_test_table1 (id, name, age, salary) VALUES (1, 'ranga',
34, 10000),(2, 'nishanth', 7, 300000);
SELECT * FROM hudi_orc_test_table1;
{code}
Refer the following Hudi issue for more details:
https://github.com/apache/hudi/issues/13221
--
This message was sent by Atlassian Jira
(v8.20.10#820010)