yihua opened a new issue, #18002:
URL: https://github.com/apache/hudi/issues/18002
### Bug Description
**What happened:**
In certain cases, an incremental query with full scan mode in MOR table
fails on Databricks Runtime with the following exception. Such cases include:
(1) start instant time is in archival timeline, (2) start instant time is in
active timeline, but some files of the incremental commit range are not
available due to cleaning or compaction. If the start and end instant time are
in active timeline and all files in the commit range are available, the
incremental query succeeds.
```
spark.read.format("org.apache.hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", "20260120221843052")
.option("hoodie.datasource.read.end.instanttime", "20260120221843053")
.option("hoodie.metadata.enable", "false")
.option("hoodie.datasource.read.incr.fallback.fulltablescan.enable",
"true")
.option("hoodie.datasource.read.incr.path.glob", "san_francisco/*")
.load("s3a://dbr-test/hudi_mor_v6").show(false)
```
```
NoSuchMethodError: 'org.apache.hadoop.fs.FileStatus
org.apache.spark.sql.execution.datasources.FileStatusWithMetadata.fileStatus()'
at
org.apache.spark.sql.execution.datasources.HoodieSpark35PartitionedFileUtils$.$anonfun$toFileStatuses$2(HoodieSpark35PartitionedFileUtils.scala:48)
at
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at
org.apache.spark.sql.execution.datasources.HoodieSpark35PartitionedFileUtils$.toFileStatuses(HoodieSpark35PartitionedFileUtils.scala:48)
at
org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:428)
at
org.apache.hudi.MergeOnReadIncrementalRelationV1.listFileSplits(MergeOnReadIncrementalRelationV1.scala:131)
at
org.apache.hudi.HoodieIncrementalFileIndex.listFiles(HoodieIncrementalFileIndex.scala:47)
at
org.apache.spark.sql.execution.datasources.FileIndex.listPartitionDirectoriesAndFiles(FileIndex.scala:234)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.listFiles(DataSourceScanExec.scala:972)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.listFiles$(DataSourceScanExec.scala:954)
at
org.apache.spark.sql.execution.FileSourceScanExec.listFiles(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.$anonfun$_selectedPartitions$2(DataSourceScanExec.scala:1046)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike._selectedPartitions(DataSourceScanExec.scala:1038)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike._selectedPartitions$(DataSourceScanExec.scala:1037)
at
org.apache.spark.sql.execution.FileSourceScanExec._selectedPartitions$lzycompute(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.FileSourceScanExec._selectedPartitions(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.setDriverMetricsForSelectedPartitions(DataSourceScanExec.scala:1064)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.selectedPartitions(DataSourceScanExec.scala:1069)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.selectedPartitions$(DataSourceScanExec.scala:1068)
at
org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.$anonfun$_dynamicallySelectedPartitions$1(DataSourceScanExec.scala:1157)
at
com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike._dynamicallySelectedPartitions(DataSourceScanExec.scala:1078)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike._dynamicallySelectedPartitions$(DataSourceScanExec.scala:1076)
at
org.apache.spark.sql.execution.FileSourceScanExec._dynamicallySelectedPartitions$lzycompute(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.FileSourceScanExec._dynamicallySelectedPartitions(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.$anonfun$dynamicallySelectedPartitions$2(DataSourceScanExec.scala:1192)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.dynamicallySelectedPartitions(DataSourceScanExec.scala:1191)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.dynamicallySelectedPartitions$(DataSourceScanExec.scala:1190)
at
org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.finalSelectedPartitions(DataSourceScanExec.scala:1230)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.finalSelectedPartitions$(DataSourceScanExec.scala:1230)
at
org.apache.spark.sql.execution.FileSourceScanExec.finalSelectedPartitions(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.totalFinalSelectedPartitionFileSize(DataSourceScanExec.scala:1219)
at
org.apache.spark.sql.execution.SparkOrAetherFileSourceScanLike.totalFinalSelectedPartitionFileSize$(DataSourceScanExec.scala:1219)
at
org.apache.spark.sql.execution.FileSourceScanExec.totalFinalSelectedPartitionFileSize$lzycompute(DataSourceScanExec.scala:3187)
at
org.apache.spark.sql.execution.FileSourceScanExec.totalFinalSelectedPartitionFileSize(DataSourceScanExec.scala:3187)
at
com.databricks.sql.transaction.tahoe.metering.DeltaMetering$.$anonfun$reportUsage$3(DeltaMetering.scala:656)
at
com.databricks.sql.transaction.tahoe.metering.DeltaMetering$.$anonfun$reportUsage$3$adapted(DeltaMetering.scala:251)
at
scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at
com.databricks.sql.transaction.tahoe.metering.DeltaMetering$.reportUsage(DeltaMetering.scala:251)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$10(SQLExecution.scala:651)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:810)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:352)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1481)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:217)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:747)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:5032)
at org.apache.spark.sql.Dataset.head(Dataset.scala:3772)
at org.apache.spark.sql.Dataset.take(Dataset.scala:4007)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:460)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:496)
at org.apache.spark.sql.Dataset.show(Dataset.scala:1113)
at org.apache.spark.sql.Dataset.show(Dataset.scala:1090)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-7721031122083688:11)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-7721031122083688:56)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$$iw$$iw$$iw$$iw.<init>(command-7721031122083688:58)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$$iw$$iw$$iw.<init>(command-7721031122083688:60)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$$iw$$iw.<init>(command-7721031122083688:62)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$$iw.<init>(command-7721031122083688:64)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read.<init>(command-7721031122083688:66)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$.<init>(command-7721031122083688:70)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$read$.<clinit>(command-7721031122083688)
at
$lineb8d76de79b95437d99b85019c4eadcc525.$eval$.$print$lzycompute(<notebook>:7)
at $lineb8d76de79b95437d99b85019c4eadcc525.$eval$.$print(<notebook>:6)
at $lineb8d76de79b95437d99b85019c4eadcc525.$eval.$print(<notebook>)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747)
at
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020)
at
scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568)
at
scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
at
scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
at
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564)
at
com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:201)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$3(ScalaDriverLocal.scala:296)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.threadSafeTrapExit(DriverLocal.scala:1811)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:1769)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:1660)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.executeCommand$1(ScalaDriverLocal.scala:296)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$2(ScalaDriverLocal.scala:265)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at scala.Console$.withErr(Console.scala:196)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:262)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at scala.Console$.withOut(Console.scala:167)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:262)
at
com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$36(DriverLocal.scala:1321)
at com.databricks.unity.EmptyHandle$.runWith(UCSHandle.scala:133)
at
com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$30(DriverLocal.scala:1312)
at
com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:49)
at
com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:293)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at
com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:289)
at
com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:47)
at
com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:44)
at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:130)
at
com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:96)
at
com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:77)
at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:130)
at
com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$1(DriverLocal.scala:1236)
at
com.databricks.backend.daemon.driver.DriverLocal$.$anonfun$maybeSynchronizeExecution$4(DriverLocal.scala:1721)
at
com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:879)
at
com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$2(DriverWrapper.scala:1054)
at scala.util.Try$.apply(Try.scala:213)
at
com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:1043)
at
com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$3(DriverWrapper.scala:1089)
at
com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:616)
at
com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:643)
at
com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:49)
at
com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:293)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at
com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:289)
at
com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:47)
at
com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:44)
at
com.databricks.backend.daemon.driver.DriverWrapper.withAttributionContext(DriverWrapper.scala:81)
at
com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:96)
at
com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:77)
at
com.databricks.backend.daemon.driver.DriverWrapper.withAttributionTags(DriverWrapper.scala:81)
at
com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:611)
at
com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:519)
at
com.databricks.backend.daemon.driver.DriverWrapper.recordOperationWithResultTags(DriverWrapper.scala:81)
at
com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:1089)
at
com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:766)
at
com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:859)
at
com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$runInnerLoop$1(DriverWrapper.scala:630)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:49)
at
com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:293)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at
com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:289)
at
com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:47)
at
com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:44)
at
com.databricks.backend.daemon.driver.DriverWrapper.withAttributionContext(DriverWrapper.scala:81)
at
com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:625)
at
com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:548)
at
com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:373)
at java.base/java.lang.Thread.run(Thread.java:840)
```
**What you expected:**
The incremental should work on Databricks Runtime.
**Steps to reproduce:**
1. Create a Databricks Compute Cluster with Databricks Runtime 16.4 LTS
(Spark 3.5, Scala 2.12) and Spark configs below with
hudi_spark3_5_bundle_2_12_1_1_1.jar
2. Create a Hudi MOR table with a few delta commits
3. Run the incremental query above with start instant time before the start
of the active timeline
### Environment
**Hudi version:** 1.1.1, master
**Query engine:** (Spark/Flink/Trino etc) Databricks Spark Runtime 16.4 LTS
(Spark 3.5, Scala 2.12/2.13), 17.3 LTS (Spark 4.0, Scala 2.13)
**Relevant configs:**
Spark configs for compute cluster:
```
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.sql.catalog.spark_catalog
org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.kryo.registrator org.apache.spark.HoodieSparkKryoRegistrar
spark.jars
dbfs:/FileStore/jars/hudi_spark3_5_bundle_2_12_1_1_1.jar,dbfs:/FileStore/jars/aws_java_sdk_bundle_1_12_48.jar,dbfs:/FileStore/jars/hadoop_aws_3_3_1.jar
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.executor.userClassPathFirst true
spark.driver.userClassPathFirst true
```
### Logs and Stack Trace
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]