vastian180 commented on PR #3341:
URL: https://github.com/apache/celeborn/pull/3341#issuecomment-2991276220
case 2: During deserialization of the GetReducerFileGroupResponse broadcast,
a failure to create the local directory leads to reporting a fetch failure.
```
25/05/27 07:27:03 INFO Executor task launch worker for task 20399
SparkUtils: Deserializing GetReducerFileGroupResponse broadcast for shuffle: 1
25/05/27 07:27:03 INFO Executor task launch worker for task 20399
TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated
total size 4.0 MiB)
25/05/27 07:27:03 INFO Executor task launch worker for task 20399
TorrentBroadcast: Reading broadcast variable 5 took 0 ms
25/05/27 07:27:03 INFO Executor task launch worker for task 20399
MemoryStore: Block broadcast_5 stored as values in memory (estimated size 980.4
KiB, free 6.3 GiB)
25/05/27 07:27:03 WARN Executor task launch worker for task 20399
BlockManager: Putting block broadcast_5 failed due to exception
java.io.IOException: Failed to create local dir in
/data12/hadoop/yarn/nm-local-dir/usercache/......
25/05/27 07:27:03 WARN Executor task launch worker for task 20399
BlockManager: Block broadcast_5 was not removed normally.
25/05/27 07:27:03 ERROR Executor task launch worker for task 20399 Utils:
Exception encountered
java.io.IOException: Failed to create local dir in
/data12/hadoop/yarn/nm-local-dir/usercache/......
at
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:93)
at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:114)
at
org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2050)
at
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1574)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1611)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1467)
at
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1936)
at
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:262)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:231)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:226)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1380)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.shuffle.celeborn.SparkUtils.lambda$deserializeGetReducerFileGroupResponse$4(SparkUtils.java:600)
at org.apache.celeborn.common.util.KeyLock.withLock(KeyLock.scala:65)
at
org.apache.spark.shuffle.celeborn.SparkUtils.deserializeGetReducerFileGroupResponse(SparkUtils.java:585)
at
org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:485)
at
org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:480)
at
org.apache.celeborn.client.ShuffleClient.deserializeReducerFileGroupResponse(ShuffleClient.java:321)
at
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1876)
at
org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
at
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
at
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
at
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
at
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:225)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:60)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.scheduler.Task.run(Task.scala:130)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:477)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1428)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:480)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
25/05/27 07:27:03 ERROR Executor task launch worker for task 20399
ShuffleClientImpl: Exception raised while call GetReducerFileGroup for 1.
org.apache.celeborn.common.exception.CelebornIOException: Failed to get
GetReducerFileGroupResponse broadcast for shuffle: 1
......
25/05/27 07:27:03 WARN Executor task launch worker for task 20399
CelebornShuffleReader: Handle fetch exceptions for 1-0
org.apache.celeborn.common.exception.CelebornIOException: Failed to load
file group of shuffle 1 partition 4001! Failed to get
GetReducerFileGroupResponse broadcast for shuffle: 1
at
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1943)
at
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
......
Caused by: org.apache.celeborn.common.exception.CelebornIOException: Failed
to get GetReducerFileGroupResponse broadcast for shuffle: 1
at
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
......
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]