[ https://issues.apache.org/jira/browse/SPARK-20540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990487#comment-15990487 ]
Apache Spark commented on SPARK-20540: -------------------------------------- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/17813 > Dynamic allocation constantly requests and kills executors > ---------------------------------------------------------- > > Key: SPARK-20540 > URL: https://issues.apache.org/jira/browse/SPARK-20540 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN > Affects Versions: 2.0.2, 2.1.0, 2.2.0 > Reporter: Ryan Blue > > We are seeing some strange behavior with dynamic allocation, where in some > cases the driver will get into a state where it constantly kills idle > executors while requesting new executors. This happens at the end of a stage > when all tasks are assigned and never stops even when there are no tasks to > run. > From the YarnAllocator logs, it looks like the allocator is getting lots of > requests from the driver, even though the timeout between requests should be > 5s: > {code:title=Yarn allocator logs} > 17/04/20 19:52:05 INFO dispatcher-event-loop-49 YarnAllocator: Driver > requested a total number of 227 executor(s). > 17/04/20 19:52:05 INFO dispatcher-event-loop-30 YarnAllocator: Driver > requested a total number of 213 executor(s). > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Will request 1 executor > containers, each with 2 cores and 7168 MB memory including 2048 MB overhead > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Canceled 0 container requests > (locality no longer needed) > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Submitted container request > (host: Any, capability: <memory:7168, vCores:2>) > spark://CoarseGrainedScheduler@100.74.39.143:10895, executorHostname: > ip-100-74-34-230.ec2.internal > spark://CoarseGrainedScheduler@100.74.39.143:10895, executorHostname: > ip-100-74-47-57.ec2.internal > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Received 2 containers from > YARN, launching executors on 2 of them. > 17/04/20 19:52:05 INFO dispatcher-event-loop-11 YarnAllocator: Driver > requested a total number of 195 executor(s). > 17/04/20 19:52:05 INFO dispatcher-event-loop-55 YarnAllocator: Driver > requested a total number of 174 executor(s). > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Will request 2 executor > containers, each with 2 cores and 7168 MB memory including 2048 MB overhead > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Canceled 0 container requests > (locality no longer needed) > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Submitted container request > (host: Any, capability: <memory:7168, vCores:2>) > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Submitted container request > (host: Any, capability: <memory:7168, vCores:2>) > 17/04/20 19:52:05 INFO Reporter YarnAllocator: Received 4 containers from > YARN, launching executors on 4 of them. > {code} > I think the allocator cancels what requests it can, but is getting containers > that have already been requested and the executors keep growing because of > requests from the driver. Here are 5 seconds from the log: > {code} > 17/04/20 19:52:30 INFO dispatcher-event-loop-22 YarnAllocator: Driver > requested a total number of 185 executor(s). > 17/04/20 19:52:30 INFO dispatcher-event-loop-48 YarnAllocator: Driver > requested a total number of 193 executor(s). > 17/04/20 19:52:30 INFO dispatcher-event-loop-24 YarnAllocator: Driver > requested a total number of 192 executor(s). > 17/04/20 19:52:30 INFO dispatcher-event-loop-60 YarnAllocator: Driver > requested a total number of 195 executor(s). > 17/04/20 19:52:30 INFO dispatcher-event-loop-53 YarnAllocator: Driver > requested a total number of 205 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-19 YarnAllocator: Driver > requested a total number of 202 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-17 YarnAllocator: Driver > requested a total number of 232 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-45 YarnAllocator: Driver > requested a total number of 243 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-19 YarnAllocator: Driver > requested a total number of 254 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-42 YarnAllocator: Driver > requested a total number of 263 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-20 YarnAllocator: Driver > requested a total number of 271 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-35 YarnAllocator: Driver > requested a total number of 280 executor(s). > 17/04/20 19:52:31 INFO dispatcher-event-loop-61 YarnAllocator: Driver > requested a total number of 289 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-22 YarnAllocator: Driver > requested a total number of 305 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-28 YarnAllocator: Driver > requested a total number of 310 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-0 YarnAllocator: Driver > requested a total number of 313 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-28 YarnAllocator: Driver > requested a total number of 315 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-40 YarnAllocator: Driver > requested a total number of 316 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-13 YarnAllocator: Driver > requested a total number of 317 executor(s). > 17/04/20 19:52:32 INFO dispatcher-event-loop-35 YarnAllocator: Driver > requested a total number of 311 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-40 YarnAllocator: Driver > requested a total number of 308 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-4 YarnAllocator: Driver > requested a total number of 301 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-23 YarnAllocator: Driver > requested a total number of 294 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-46 YarnAllocator: Driver > requested a total number of 287 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-8 YarnAllocator: Driver > requested a total number of 285 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-63 YarnAllocator: Driver > requested a total number of 283 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-35 YarnAllocator: Driver > requested a total number of 281 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-63 YarnAllocator: Driver > requested a total number of 278 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-3 YarnAllocator: Driver > requested a total number of 277 executor(s). > 17/04/20 19:52:33 INFO dispatcher-event-loop-38 YarnAllocator: Driver > requested a total number of 276 executor(s). > 17/04/20 19:52:34 INFO dispatcher-event-loop-51 YarnAllocator: Driver > requested a total number of 273 executor(s). > 17/04/20 19:52:34 INFO dispatcher-event-loop-31 YarnAllocator: Driver > requested a total number of 271 executor(s). > 17/04/20 19:52:34 INFO dispatcher-event-loop-44 YarnAllocator: Driver > requested a total number of 270 executor(s). > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org