[ https://issues.apache.org/jira/browse/SPARK-17667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527128#comment-15527128 ]
Apache Spark commented on SPARK-17667: -------------------------------------- User 'ashwinshankar77' has created a pull request for this issue: https://github.com/apache/spark/pull/15267 > Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest > ---------------------------------------------------------------------- > > Key: SPARK-17667 > URL: https://issues.apache.org/jira/browse/SPARK-17667 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.6.2, 2.0.0 > Reporter: Ashwin Shankar > > Following up on the discussion in SPARK-15725, one of the reason for AM > hanging with dynamic allocation(DA) is the way locking is done in > YarnAllocator. We noticed that when executors go down during the shrink phase > of DA, AM gets locked up. On taking thread dump, we see threads trying to get > loss for reason via YarnAllocator#enqueueGetLossReasonRequest, and they are > all BLOCKED waiting for lock acquired by allocate call. This gets worse when > the number of executors go down are in the thousands, and I've seen AM hang > in the order of minutes. This jira is created to make the locking little more > fine grained by remembering the executors that were killed via AM, and then > serve the GetExecutorLossReason requests with that information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org