ssilb4 opened a new issue, #11241: URL: https://github.com/apache/hudi/issues/11241
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. I used hudi in emr 6.5. It worked well. but after version up emr 6.5 -> 6.14, it is too slow, and it happens memory issue in big table. I think shuffle partition size is so small(100) at countByKey. ``` upsertDF.write.format("org.apache.hudi") .option("hoodie.table.name", table) .option("hoodie.datasource.write.partitionpath.field", "") .option("hoodie.datasource.write.keygenerator.class", keyGeneratorClass) .option("hoodie.datasource.write.recordkey.field", recordkeyField) .option("hoodie.datasource.write.precombine.field", precombineField) .option("hoodie.datasource.hive_sync.database", database) .option("hoodie.datasource.hive_sync.table", table) .option("hoodie.datasource.hive_sync.enable", "true") .option("hoodie.datasource.hive_sync.support_timestamp", "true") .option("hoodie.datasource.write.operation", "upsert") .option("hoodie.datasource.write.storage.type", "COPY_ON_WRITE") .option("hoodie.upsert.shuffle.parallelism", "200") .option("hoodie.datasource.hive_sync.use_jdbc", "false") .option("hoodie.datasource.hive_sync.mode", "hms") ``` Can I solve it? or use old logic in new table version. **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.13.1-amzn-2 * Spark version : 3.4.1 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) :S3 * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` 24/05/14 16:41:07 INFO TaskSetManager: Finished task 73.0 in stage 9.4 (TID 492) in 374782 ms on 10.2.52.241 (executor 9) (74/74) 24/05/14 16:41:07 INFO TaskSchedulerImpl: Removed TaskSet 9.4, whose tasks have all completed, from pool 24/05/14 16:41:07 INFO DAGScheduler: ShuffleMapStage 9 (mapToPair at HoodieJavaRDD.java:135) finished in 1891.277 s 24/05/14 16:41:07 INFO DAGScheduler: looking for newly runnable stages 24/05/14 16:41:07 INFO DAGScheduler: running: Set() 24/05/14 16:41:07 INFO DAGScheduler: waiting: Set(ShuffleMapStage 10, ResultStage 11) 24/05/14 16:41:07 INFO DAGScheduler: failed: Set() 24/05/14 16:41:07 INFO DAGScheduler: Submitting ShuffleMapStage 10 (MapPartitionsRDD[50] at countByKey at HoodieJavaPairRDD.java:105), which has no missing parents 24/05/14 16:41:07 INFO OutputCommitCoordinator: Reusing state from previous attempt of stage 10. 24/05/14 16:41:07 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 10.1 KiB, free 578.2 MiB) 24/05/14 16:41:07 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 4.1 KiB, free 578.2 MiB) 24/05/14 16:41:07 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on spark-000000033rnbca7s840-d0ae418f775b35d9-driver-svc.de-emr.svc:7079 (size: 4.1 KiB, free: 579.1 MiB) 24/05/14 16:41:07 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:1592 24/05/14 16:41:07 INFO DAGScheduler: Submitting 100 missing tasks from ShuffleMapStage 10 (MapPartitionsRDD[50] at countByKey at HoodieJavaPairRDD.java:105) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 24/05/14 16:41:07 INFO TaskSchedulerImpl: Adding task set 10.2 with 100 tasks resource profile 0 24/05/14 16:41:07 INFO TaskSetManager: Starting task 0.0 in stage 10.2 (TID 493) (10.2.51.128, executor 8, partition 0, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 1.0 in stage 10.2 (TID 494) (10.2.53.82, executor 7, partition 1, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 2.0 in stage 10.2 (TID 495) (10.2.52.241, executor 9, partition 2, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 3.0 in stage 10.2 (TID 496) (10.2.51.128, executor 8, partition 3, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 4.0 in stage 10.2 (TID 497) (10.2.53.82, executor 7, partition 4, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 5.0 in stage 10.2 (TID 498) (10.2.52.241, executor 9, partition 5, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 6.0 in stage 10.2 (TID 499) (10.2.51.128, executor 8, partition 6, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 7.0 in stage 10.2 (TID 500) (10.2.53.82, executor 7, partition 7, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 8.0 in stage 10.2 (TID 501) (10.2.52.241, executor 9, partition 8, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 9.0 in stage 10.2 (TID 502) (10.2.51.128, executor 8, partition 9, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 10.0 in stage 10.2 (TID 503) (10.2.53.82, executor 7, partition 10, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 11.0 in stage 10.2 (TID 504) (10.2.52.241, executor 9, partition 11, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 12.0 in stage 10.2 (TID 505) (10.2.51.128, executor 8, partition 12, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 13.0 in stage 10.2 (TID 506) (10.2.53.82, executor 7, partition 13, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO TaskSetManager: Starting task 14.0 in stage 10.2 (TID 507) (10.2.52.241, executor 9, partition 14, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:41:07 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 10.2.53.82:40889 (size: 4.1 KiB, free: 159.8 GiB) 24/05/14 16:41:07 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 10.2.51.128:41761 (size: 4.1 KiB, free: 159.8 GiB) 24/05/14 16:41:07 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 10.2.52.241:41959 (size: 4.1 KiB, free: 159.8 GiB) 24/05/14 16:41:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 4 to 10.2.53.82:56366 24/05/14 16:41:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 4 to 10.2.51.128:58048 24/05/14 16:41:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 4 to 10.2.52.241:60410 24/05/14 16:41:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.2.53.82:56366 24/05/14 16:41:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.2.52.241:60410 24/05/14 16:41:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.2.51.128:58048 24/05/14 16:41:07 INFO BlockManagerInfo: Removed broadcast_24_piece0 on spark-000000033rnbca7s840-d0ae418f775b35d9-driver-svc.de-emr.svc:7079 in memory (size: 41.0 KiB, free: 579.2 MiB) 24/05/14 16:41:07 INFO BlockManagerInfo: Removed broadcast_24_piece0 on 10.2.51.128:41761 in memory (size: 41.0 KiB, free: 159.8 GiB) 24/05/14 16:41:07 INFO BlockManagerInfo: Removed broadcast_24_piece0 on 10.2.52.241:41959 in memory (size: 41.0 KiB, free: 159.8 GiB) 24/05/14 16:41:07 INFO BlockManagerInfo: Removed broadcast_24_piece0 on 10.2.53.82:40889 in memory (size: 41.0 KiB, free: 159.8 GiB) 24/05/14 16:55:21 INFO BlockManagerInfo: Updated broadcast_1_piece0 on disk on 10.2.53.82:40889 (current size: 28.2 KiB, original size: 0.0 B) 24/05/14 16:55:21 INFO BlockManagerInfo: Updated rdd_6_4 on disk on 10.2.53.82:40889 (current size: 71.6 KiB, original size: 0.0 B) 24/05/14 16:55:29 INFO BlockManagerInfo: Updated broadcast_1_piece0 on disk on 10.2.51.128:41761 (current size: 28.2 KiB, original size: 0.0 B) 24/05/14 16:55:29 INFO BlockManagerInfo: Updated rdd_6_6 on disk on 10.2.51.128:41761 (current size: 69.0 KiB, original size: 0.0 B) 24/05/14 16:56:26 INFO BlockManagerInfo: Added rdd_48_9 in memory on 10.2.51.128:41761 (size: 84.7 KiB, free: 159.8 GiB) 24/05/14 16:56:26 INFO TaskSetManager: Starting task 15.0 in stage 10.2 (TID 508) (10.2.51.128, executor 8, partition 15, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:56:26 INFO TaskSetManager: Finished task 9.0 in stage 10.2 (TID 502) in 918772 ms on 10.2.51.128 (executor 8) (1/100) 24/05/14 16:56:54 INFO BlockManagerInfo: Added rdd_48_13 in memory on 10.2.53.82:40889 (size: 84.7 KiB, free: 159.8 GiB) 24/05/14 16:56:54 INFO TaskSetManager: Starting task 16.0 in stage 10.2 (TID 509) (10.2.53.82, executor 7, partition 16, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:56:54 INFO TaskSetManager: Finished task 13.0 in stage 10.2 (TID 506) in 947531 ms on 10.2.53.82 (executor 7) (2/100) 24/05/14 16:57:18 INFO BlockManagerInfo: Added rdd_48_11 on disk on 10.2.52.241:41959 (size: 84.6 KiB) 24/05/14 16:57:18 INFO TaskSetManager: Starting task 17.0 in stage 10.2 (TID 510) (10.2.52.241, executor 9, partition 17, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:57:18 INFO TaskSetManager: Finished task 11.0 in stage 10.2 (TID 504) in 971644 ms on 10.2.52.241 (executor 9) (3/100) 24/05/14 16:57:57 INFO BlockManagerInfo: Added rdd_48_8 on disk on 10.2.52.241:41959 (size: 83.5 KiB) 24/05/14 16:57:57 INFO TaskSetManager: Starting task 18.0 in stage 10.2 (TID 511) (10.2.52.241, executor 9, partition 18, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:57:57 INFO TaskSetManager: Finished task 8.0 in stage 10.2 (TID 501) in 1010317 ms on 10.2.52.241 (executor 9) (4/100) 24/05/14 16:58:12 INFO BlockManagerInfo: Added rdd_48_14 on disk on 10.2.52.241:41959 (size: 84.4 KiB) 24/05/14 16:58:12 INFO TaskSetManager: Starting task 19.0 in stage 10.2 (TID 512) (10.2.52.241, executor 9, partition 19, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:58:12 INFO TaskSetManager: Finished task 14.0 in stage 10.2 (TID 507) in 1025524 ms on 10.2.52.241 (executor 9) (5/100) 24/05/14 16:58:49 INFO BlockManagerInfo: Added rdd_48_7 in memory on 10.2.53.82:40889 (size: 83.8 KiB, free: 159.8 GiB) 24/05/14 16:58:49 INFO TaskSetManager: Starting task 20.0 in stage 10.2 (TID 513) (10.2.53.82, executor 7, partition 20, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:58:49 INFO TaskSetManager: Finished task 7.0 in stage 10.2 (TID 500) in 1062262 ms on 10.2.53.82 (executor 7) (6/100) 24/05/14 16:59:05 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 9. 24/05/14 16:59:05 INFO DAGScheduler: Executor lost: 9 (epoch 28) 24/05/14 16:59:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster. 24/05/14 16:59:05 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_48_11 ! 24/05/14 16:59:05 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_48_8 ! 24/05/14 16:59:05 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_48_14 ! 24/05/14 16:59:05 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(9, 10.2.52.241, 41959, None) 24/05/14 16:59:05 INFO BlockManagerMaster: Removed 9 successfully in removeExecutor 24/05/14 16:59:05 INFO DAGScheduler: Shuffle files lost for executor: 9 (epoch 28) 24/05/14 16:59:06 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 3, known: 2, sharedSlotFromPendingPods: 2147483647. 24/05/14 16:59:06 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs 24/05/14 16:59:06 INFO KubernetesClientUtils: Spark configuration files loaded from Some(/usr/lib/spark/conf) : spark-env.sh,hive-site.xml,log4j2.properties,metrics.properties 24/05/14 16:59:06 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script 24/05/14 16:59:06 ERROR TaskSchedulerImpl: Lost executor 9 on 10.2.52.241: The executor with id 9 exited with exit code 137(SIGKILL, possible container OOM). ... 24/05/14 16:59:06 INFO DAGScheduler: Resubmitted ShuffleMapTask(10, 11), so marking it as still running. 24/05/14 16:59:10 INFO TaskSetManager: Starting task 18.1 in stage 10.2 (TID 514) (10.2.51.128, executor 8, partition 18, PROCESS_LOCAL, 7239 bytes) 24/05/14 16:59:10 WARN TaskSetManager: Lost task 15.0 in stage 10.2 (TID 508) (10.2.51.128 executor 8): FetchFailed(BlockManagerId(9, 10.2.52.241, 41959, None), shuffleId=5, mapIndex=0, mapId=368, reduceId=15, message= org.apache.spark.shuffle.FetchFailedException at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:437) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1232) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:971) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:86) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.rdd.CoGroupedRDD.$anonfun$compute$4(CoGroupedRDD.scala:155) at org.apache.spark.rdd.CoGroupedRDD.$anonfun$compute$4$adapted(CoGroupedRDD.scala:154) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:377) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375) at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 9), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:140) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more ) 24/05/14 16:59:10 INFO TaskSetManager: task 15.0 in stage 10.2 (TID 508) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 24/05/14 16:59:10 INFO DAGScheduler: Marking ShuffleMapStage 10 (countByKey at HoodieJavaPairRDD.java:105) as failed due to a fetch failure from ShuffleMapStage 9 (mapToPair at HoodieJavaRDD.java:135) 24/05/14 16:59:10 INFO DAGScheduler: ShuffleMapStage 10 (countByKey at HoodieJavaPairRDD.java:105) failed in 1083.173 s due to org.apache.spark.shuffle.FetchFailedException at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:437) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1232) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:971) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:86) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.rdd.CoGroupedRDD.$anonfun$compute$4(CoGroupedRDD.scala:155) at org.apache.spark.rdd.CoGroupedRDD.$anonfun$compute$4$adapted(CoGroupedRDD.scala:154) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:377) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375) at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 9), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:140) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more 24/05/14 16:59:10 INFO DAGScheduler: Resubmitting ShuffleMapStage 9 (mapToPair at HoodieJavaRDD.java:135) and ShuffleMapStage 10 (countByKey at HoodieJavaPairRDD.java:105) due to fetch failure 24/05/14 16:59:10 INFO DAGScheduler: Resubmitting failed stages 24/05/14 16:59:10 INFO DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[42] at mapToPair at HoodieJavaRDD.java:135), which has no missing parents 24/05/14 16:59:10 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 492.3 KiB, free 578.3 MiB) 24/05/14 16:59:10 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 41.0 KiB, free 578.2 MiB) 24/05/14 16:59:10 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on spark-000000033rnbca7s840-d0ae418f775b35d9-driver-svc.de-emr.svc:7079 (size: 41.0 KiB, free: 579.1 MiB) 24/05/14 16:59:10 INFO SparkContext: Created broadcast 26 from broadcast at DAGScheduler.scala:1592 24/05/14 16:59:10 INFO DAGScheduler: Submitting 33 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[42] at mapToPair at HoodieJavaRDD.java:135) (first 15 tasks are for partitions Vector(0, 4, 8, 11, 15, 19, 20, 26, 27, 28, 29, 31, 32, 33, 34)) 24/05/14 16:59:10 INFO TaskSchedulerImpl: Adding task set 9.5 with 33 tasks resource profile 0 24/05/14 16:59:28 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for 10.2.54.49:38088 24/05/14 16:59:28 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.2.54.49:38094) with ID 10, ResourceProfileId 0 24/05/14 16:59:28 INFO BlockManagerMasterEndpoint: Registering block manager 10.2.54.49:45969 with 159.8 GiB RAM, BlockManagerId(10, 10.2.54.49, 45969, None) 24/05/14 16:59:34 INFO TaskSetManager: Starting task 0.0 in stage 9.5 (TID 515) (10.2.54.49, executor 10, partition 0, PROCESS_LOCAL, 8770 bytes) 24/05/14 16:59:34 INFO TaskSetManager: Starting task 1.0 in stage 9.5 (TID 516) (10.2.54.49, executor 10, partition 4, PROCESS_LOCAL, 8782 bytes) 24/05/14 16:59:34 INFO TaskSetManager: Starting task 2.0 in stage 9.5 (TID 517) (10.2.54.49, executor 10, partition 8, PROCESS_LOCAL, 8776 bytes) 24/05/14 16:59:34 INFO TaskSetManager: Starting task 3.0 in stage 9.5 (TID 518) (10.2.54.49, executor 10, partition 11, PROCESS_LOCAL, 8774 bytes) 24/05/14 16:59:34 INFO TaskSetManager: Starting task 4.0 in stage 9.5 (TID 519) (10.2.54.49, executor 10, partition 15, PROCESS_LOCAL, 8778 bytes) 24/05/14 16:59:35 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on 10.2.54.49:45969 (size: 41.0 KiB, free: 159.8 GiB) 24/05/14 17:00:12 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 8. 24/05/14 17:00:12 INFO DAGScheduler: Executor lost: 8 (epoch 30) 24/05/14 17:00:12 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster. 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_6 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_1 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_14 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_5 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_10 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_48_9 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_16 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_15 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_3 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_2 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_8 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_12 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_11 ! 24/05/14 17:00:12 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_7 ! 24/05/14 17:00:12 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, 10.2.51.128, 41761, None) 24/05/14 17:00:12 INFO BlockManagerMaster: Removed 8 successfully in removeExecutor 24/05/14 17:00:12 INFO DAGScheduler: Shuffle files lost for executor: 8 (epoch 30) 24/05/14 17:00:13 ERROR TaskSchedulerImpl: Lost executor 8 on 10.2.51.128: The executor with id 8 exited with exit code 137(SIGKILL, possible container OOM). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org