[ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553907#comment-16553907 ]
liupengcheng commented on SPARK-24897: -------------------------------------- Can some one verify the issue? I checked the 2.3.1 branch, this problem seems not fixed. > DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for > stage fetchFailed > ---------------------------------------------------------------------------------------------- > > Key: SPARK-24897 > URL: https://issues.apache.org/jira/browse/SPARK-24897 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 2.1.0, 2.3.1 > Reporter: liupengcheng > Priority: Major > > In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage > and it's parent stage, however, when the parent stage is resubmitted and > start running, the mapstatuses can > still be invalidate by the stage's outstanding task due to fetchfailed. > The stage's outstanding task might unregister the mapstatuses with new epoch, > thus causing > the parent stage repeated MetadataFetchFailed and finally failling the Job. > > > {code:java} > 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 174.0 in stage 71.0 (TID 154127, <host>, executor 96): > FetchFailed(BlockManagerId(4945, <host>, 22409), shuffleId=24, mapId=667, > reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to <host>/<ip>:22409 > 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a > fetch failure from ShuffleMapStage 69 ($plus$plus at > DeviceLocateMain.scala:236) > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 120.0 in stage 71.0 (TID 154073, <host>, executor 286): > FetchFailed(BlockManagerId(4208, <host>, 22409), shuffleId=24, mapId=241, > reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to <host>/<ip>:22409 > 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free > 26.7 MB) > 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_59_piece1 on <host>:52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece6 on <host>:52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: > Canceling requests for 0 executor containers > 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: > Expected to find pending requests, but found none. > 2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_63_piece0 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 > GB) > 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free > 30.7 MB) > 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece1 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free > 34.7 MB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece2 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free > 38.7 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece3 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free > 42.5 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece4 in memory on <host>:56780 (size: 3.8 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast > mapstatuses size = 384, actual size = 20784475 > 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTrackerMaster: Size of > output statuses for shuffle 17 is 384 bytes > 2018-07-23,01:52:36,133 INFO org.apache.spark.MapOutputTrackerMaster: Epoch > changed, not caching! > 2018-07-23,01:52:36,185 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece3 on <host>:52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,185 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_63_piece4 on <host>:56780 in memory (size: 3.8 MB, free: > 3.7 GB) > 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_63_piece2 on <host>:56780 in memory (size: 4.0 MB, free: > 3.7 GB) > 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_63_piece1 on <host>:56780 in memory (size: 4.0 MB, free: > 3.7 GB) > 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_63_piece0 on <host>:56780 in memory (size: 4.0 MB, free: > 3.7 GB) > 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_63_piece3 on <host>:56780 in memory (size: 4.0 MB, free: > 3.7 GB) > 2018-07-23,01:52:36,192 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 1.0 in stage 69.1 (TID 154955, <host>, executor 267): FetchFailed(null, > shuffleId=17, mapId=-1, reduceId=-1, message= > org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_63_piece4 of > broadcast_63 > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org