Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

via GitHub Wed, 20 Mar 2024 07:54:18 -0700


bhat-vinay commented on code in PR #10865:
URL: https://github.com/apache/hudi/pull/10865#discussion_r1532227150



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##########
@@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource(
       QueryRunner queryRunner,
       CloudDataFetcher cloudDataFetcher) {
     super(props, sparkContext, sparkSession, schemaProvider);
+
+    if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) {
+      sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true");
+      sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true");

Review Comment:
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0) (192.168.0.101 executor driver): java.io.FileNotFoundException: 
   File 
file:/var/folders/9g/815t_9ns1pg2h631kb3792zw0000gn/T/junit5612947518157164794/data1.json
 does not exist
   
   It is possible the underlying files have been updated. You can explicitly 
invalidate
   the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
   recreating the Dataset/DataFrame involved.
          
        at 
org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:212)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

Reply via email to