[jira] [Updated] (YARN-4685) AM blacklisting result in application to get hanged
[ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4685: Attachment: YARN-4685.patch Updated the patch with 2 changes in configurations. # Reduced blacklisting threshold to 20% # Default value for blacklist-enabled is set to false. > AM blacklisting result in application to get hanged > --- > > Key: YARN-4685 > URL: https://issues.apache.org/jira/browse/YARN-4685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-4685-workaround.patch, YARN-4685.patch > > > AM blacklist addition or removal is updated only when RMAppAttempt is > scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once > attempt is scheduled if there is any removeNode/addNode in cluster then this > is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads > BlackListManager to operate on stale NM's count. And application is in > ACCEPTED state and wait forever even if blacklisted nodes are reconnected > with clearing disk space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4685) AM blacklisting result in application to get hanged
[ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4685: Attachment: YARN-4685-workaround.patch > AM blacklisting result in application to get hanged > --- > > Key: YARN-4685 > URL: https://issues.apache.org/jira/browse/YARN-4685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-4685-workaround.patch > > > AM blacklist addition or removal is updated only when RMAppAttempt is > scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once > attempt is scheduled if there is any removeNode/addNode in cluster then this > is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads > BlackListManager to operate on stale NM's count. And application is in > ACCEPTED state and wait forever even if blacklisted nodes are reconnected > with clearing disk space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4685) AM blacklisting result in application to get hanged
[ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4685: Description: AM blacklist addition or removal is updated only when RMAppAttempt is scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once attempt is scheduled if there is any removeNode/addNode in cluster then this is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads BlackListManager to operate on stale NM's count. And application is in ACCEPTED state and wait forever even if blacklisted nodes are reconnected with clearing disk space. was: AM blacklist addition or removal is updated only when RMAppAttempt is scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once attempt is scheduled if there is any removeNode/addNode in cluster then this is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads BlackListManager to operate on stale NM's count. And application is in ACCEPTED state and wait forever even if we add more nodes to cluster. Solution is update BlacklistManager for every {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This ensures if there is any addition/removal in nodes, this will be updated to BlacklistManager > AM blacklisting result in application to get hanged > --- > > Key: YARN-4685 > URL: https://issues.apache.org/jira/browse/YARN-4685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > > AM blacklist addition or removal is updated only when RMAppAttempt is > scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once > attempt is scheduled if there is any removeNode/addNode in cluster then this > is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads > BlackListManager to operate on stale NM's count. And application is in > ACCEPTED state and wait forever even if blacklisted nodes are reconnected > with clearing disk space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4685) AM blacklisting result in application to get hanged
[ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4685: -- Priority: Critical (was: Major) > AM blacklisting result in application to get hanged > --- > > Key: YARN-4685 > URL: https://issues.apache.org/jira/browse/YARN-4685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > > AM blacklist addition or removal is updated only when RMAppAttempt is > scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once > attempt is scheduled if there is any removeNode/addNode in cluster then this > is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads > BlackListManager to operate on stale NM's count. And application is in > ACCEPTED state and wait forever even if we add more nodes to cluster. > Solution is update BlacklistManager for every > {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This > ensures if there is any addition/removal in nodes, this will be updated to > BlacklistManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4685) AM blacklisting result in application to get hanged
[ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4685: Summary: AM blacklisting result in application to get hanged (was: AM blacklist addition/removal should get updated for every allocate call from RMAppAttemptImpl.) > AM blacklisting result in application to get hanged > --- > > Key: YARN-4685 > URL: https://issues.apache.org/jira/browse/YARN-4685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > > AM blacklist addition or removal is updated only when RMAppAttempt is > scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once > attempt is scheduled if there is any removeNode/addNode in cluster then this > is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads > BlackListManager to operate on stale NM's count. And application is in > ACCEPTED state and wait forever even if we add more nodes to cluster. > Solution is update BlacklistManager for every > {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This > ensures if there is any addition/removal in nodes, this will be updated to > BlacklistManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)