[ https://issues.apache.org/jira/browse/SPARK-30849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
wuyi resolved SPARK-30849. -------------------------- Resolution: Duplicate > Application failed due to failed to get MapStatuses broadcast > ------------------------------------------------------------- > > Key: SPARK-30849 > URL: https://issues.apache.org/jira/browse/SPARK-30849 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0 > Reporter: liupengcheng > Priority: Major > Attachments: image-2020-02-16-11-13-18-195.png, > image-2020-02-16-11-17-32-103.png > > > Currently, we encountered an issue in Spark2.1. The exception is as follows: > {noformat} > Job aborted due to stage failure: Task 18 in stage 2.0 failed 4 times, > most recent failure: Lost task 18.3 in stage 2.0 (TID 13819, xxxx , executor > 8): java.io.IOException: org.apache.spark.SparkException: Failed to get > broadcast_9_piece1 of broadcast_9 > java.io.IOException: org.apache.spark.SparkException: Failed to get > broadcast_9_piece1 of broadcast_9 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1287) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:775) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:775) > at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:712) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:774) > at > org.apache.spark.MapOutputTrackerWorker.getStatuses(MapOutputTracker.scala:665) > at > org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:603) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:57) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109) > {noformat} > I looked into the code and the logs, it seems that it's caused by the > mapStatuses broadcast id is sent to executor, but was invalidated immediately > by the driver before the real fetching of the broadcast. > This can be described as follows: > {noformat} > Let's say we have an rdd1, > rdd2 = rdd1.repartition(100) // stage 0 > rdd3 = rdd2.map(xxx) // stage 1 > rdd4 = rdd2.map(xxx) // stage 2 > // and then do some join and output result > rdd3.join(rdd4).save > {noformat} > When FetchFailedException happened in stage 1, then stage 0 and stage 1 will > be resubmitted and re-executed, but stage 2 is still running, it's task will > fetch mapStatuses from driver, but the mapStatuses cache will be invalidated > when tasks of stage 0.1 completes and registerMapOutput. > I checked the master branch, seems that we are fixed correctness issues on > `repartition`, but I think this issue may still exist? > Some ScreenShot: > !https://issues.apache.org/jira/secure/attachment/12993652/image-2020-02-16-11-17-32-103.png! > !https://issues.apache.org/jira/secure/attachment/12993651/image-2020-02-16-11-13-18-195.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org