vastian180 commented on PR #3341:
URL: https://github.com/apache/celeborn/pull/3341#issuecomment-2991276220

   case 2: During deserialization of the GetReducerFileGroupResponse broadcast, 
a failure to create the local directory leads to reporting a fetch failure.
   ```
   25/05/27 07:27:03 INFO Executor task launch worker for task 20399 
SparkUtils: Deserializing GetReducerFileGroupResponse broadcast for shuffle: 1
   25/05/27 07:27:03 INFO Executor task launch worker for task 20399 
TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated 
total size 4.0 MiB)
   25/05/27 07:27:03 INFO Executor task launch worker for task 20399 
TorrentBroadcast: Reading broadcast variable 5 took 0 ms
   25/05/27 07:27:03 INFO Executor task launch worker for task 20399 
MemoryStore: Block broadcast_5 stored as values in memory (estimated size 980.4 
KiB, free 6.3 GiB)
   25/05/27 07:27:03 WARN Executor task launch worker for task 20399 
BlockManager: Putting block broadcast_5 failed due to exception 
java.io.IOException: Failed to create local dir in 
/data12/hadoop/yarn/nm-local-dir/usercache/......
   25/05/27 07:27:03 WARN Executor task launch worker for task 20399 
BlockManager: Block broadcast_5 was not removed normally.
   25/05/27 07:27:03 ERROR Executor task launch worker for task 20399 Utils: 
Exception encountered
   java.io.IOException: Failed to create local dir in 
/data12/hadoop/yarn/nm-local-dir/usercache/......
        at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:93)
        at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:114)
        at 
org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2050)
        at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1574)
        at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1611)
        at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1467)
        at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1936)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:262)
        at scala.Option.getOrElse(Option.scala:189)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:231)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:226)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1380)
        at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226)
        at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at 
org.apache.spark.shuffle.celeborn.SparkUtils.lambda$deserializeGetReducerFileGroupResponse$4(SparkUtils.java:600)
        at org.apache.celeborn.common.util.KeyLock.withLock(KeyLock.scala:65)
        at 
org.apache.spark.shuffle.celeborn.SparkUtils.deserializeGetReducerFileGroupResponse(SparkUtils.java:585)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:485)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:480)
        at 
org.apache.celeborn.client.ShuffleClient.deserializeReducerFileGroupResponse(ShuffleClient.java:321)
        at 
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1876)
        at 
org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
        at 
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
        at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:225)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:60)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at org.apache.spark.scheduler.Task.run(Task.scala:130)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:477)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1428)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:480)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
   
   25/05/27 07:27:03 ERROR Executor task launch worker for task 20399 
ShuffleClientImpl: Exception raised while call GetReducerFileGroup for 1.
   org.apache.celeborn.common.exception.CelebornIOException: Failed to get 
GetReducerFileGroupResponse broadcast for shuffle: 1
   ......
   
   25/05/27 07:27:03 WARN Executor task launch worker for task 20399 
CelebornShuffleReader: Handle fetch exceptions for 1-0
   org.apache.celeborn.common.exception.CelebornIOException: Failed to load 
file group of shuffle 1 partition 4001! Failed to get 
GetReducerFileGroupResponse broadcast for shuffle: 1
        at 
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1943)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
   ......
   Caused by: org.apache.celeborn.common.exception.CelebornIOException: Failed 
to get GetReducerFileGroupResponse broadcast for shuffle: 1
        at 
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
   ......
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to