Ranga Reddy created HUDI-9344:
---------------------------------

             Summary: Hudi Fails to Read Data in ORC File Format
                 Key: HUDI-9344
                 URL: https://issues.apache.org/jira/browse/HUDI-9344
             Project: Apache Hudi
          Issue Type: Bug
          Components: spark-sql
            Reporter: Ranga Reddy
             Fix For: 1.1.0


Apache Hudi fails to read the ORC file format when both 
*hoodie.base.file.format* and *hoodie.table.base.file.format* are 
specified.{{{}{}}}

*Exception:*

{{}}
{code:java}
java.lang.RuntimeException: 
s3a://warehouse/hudi_orc_test_table1/age=7/6dc054f7-3cfb-4c84-bb93-df0faf4b262a-0_0-11-13_20250428085736017.orc
 is not a Parquet file. Expected magic number at tail, but found [79, 82, 67, 
25]
    at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
    at 
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
    at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
    at 
org.apache.spark.sql.execution.datasources.parquet.Spark35ParquetReader.doRead(Spark35ParquetReader.scala:101)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:80)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.readBaseFile(HoodieFileGroupReaderBasedParquetFileFormat.scala:286)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:204)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829){code}
{{{}{}}}{*}Sample code to reproduce this issue:{*}

 
{code:java}
CREATE TABLE hudi_orc_test_table1 (
    `id` BIGINT,
    `name` STRING,
    `age` INT,
    `salary` DOUBLE
) USING hudi
PARTITIONED BY (age)
LOCATION 's3a://warehouse/hudi_orc_test_table1'
TBLPROPERTIES (
  'hoodie.base.file.format' = 'ORC',
  'hoodie.table.base.file.format' = 'ORC',
  'type' = 'cow');
INSERT INTO hudi_orc_test_table1 (id, name, age, salary) VALUES (1, 'ranga', 
34, 10000),(2, 'nishanth', 7, 300000);
SELECT * FROM hudi_orc_test_table1;
{code}
 

Refer the following Hudi issue for more details:

https://github.com/apache/hudi/issues/13221



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to