[jira] [Commented] (YARN-4314) Adding container wait time as a metric at queue level and application level.
[ https://issues.apache.org/jira/browse/YARN-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982340#comment-14982340 ] Raju Bairishetti commented on YARN-4314: I feel adding timestamp to each resource request will be costly and all the existing applications will need to migrate to use this metric. Had a discussion with [~sriksun] earlier about this approach. Resource request is prepared by AM. In future if we want to use this timestamp as priority for allocating resources then there is a chance that user/AM can misuse the system by saying they have older time stamps. Thinking about this approach: AppSchedulingInfo has all the scheduling info about an application. When RM receives first resource a request from AM then RM can note down the system time as resource request time. Whenever new request comes(i.e. UpdateResourceRequest() in AppSchedulingInfo or allocate()) then we can measure how many containers were waited till this time from the last request time. We can mostly listen on the container request & allocate events. I will put up a detailed doc with all my thoughts & approaches. > Adding container wait time as a metric at queue level and application level. > > > Key: YARN-4314 > URL: https://issues.apache.org/jira/browse/YARN-4314 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > > There is a need for adding the container wait-time which can be tracked at > the queue and application level. > An application can have two kinds of wait times. One is AM wait time after > submission and another is total container wait time between AM asking for > containers and getting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-473) Capacity Scheduler webpage and REST API not showing correct number of pending applications
[ https://issues.apache.org/jira/browse/YARN-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659695#comment-14659695 ] Raju Bairishetti commented on YARN-473: --- REST API is showing wrong number of pending applications as it was populating the value from *pendingApplications set* (i.e. size of pendingApplications set). IMO, it should get the value from Queue metrics instead of getting from in-memory data structures. LeafQueue : {noformat} 1) *pendingApplications* : all submitted applications will be added to this set first. 2) *activeApplications* : applications added to this set from pendingApplications set if the number of active applications are lesser than maximum number of active applications. {noformat} Jmx metrics showing appsPending metric properly. AppsPending will be incremented when the application is submitted to Queue and decremented only when the application is actually launched(i.e. allocated some resources to it). IMO, Rest call for queue info should also use queue metrics instead of depending on other data structures. Applications from pendingApplications set will be removed before the launching of application. pendingApplciations set contains the applications which are not ready for schedulable at this moment. One more interesting fact, CapacitySchedulerPage(in UI) has "Num Non-Schedulable Applications" and getting the value form pendingApplications set. Rest API call is correct in this case. I am thinking couple of approaches to fix this issue: {noformat} 1) Rename the pendingApplications to nonSchedulableApplications in CapacitySchedulerLeafQueueInfo class and introduce new fields (pendingApplications) in the CapacitySchedulerLeafQueueInfo and get this value from QueueMetrics. refactor activeApplications to schedulableApplications and introducing a new field(activeApplications) for the same which tells exactly how many AMs are there. 2. Refactor "Num Non-Schedulable Applications:" to "Num of Pending Applications" and refactor "Num Schedulable Applications" to activeApplications. Update properly through QueueMetrics instead of depending on in-memory data structures in the LeafQueue to get the values. {noformat} > Capacity Scheduler webpage and REST API not showing correct number of pending > applications > -- > > Key: YARN-473 > URL: https://issues.apache.org/jira/browse/YARN-473 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Mit Desai > Labels: usability > > The Capacity Scheduler REST API > (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API) > is not returning the correct number of pending applications. > numPendingApplications is almost always zero, even if there are dozens of > pending apps. > In investigating this, I discovered that the Resource Manager's Scheduler > webpage is also showing an incorrect but different number of pending > applications. For example, the cluster I'm looking at right now currently > has 15 applications in the ACCEPTED state, but the Cluster Metrics table near > the top of the page says there are only 2 pending apps. The REST API says > there are zero pending apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3972) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/YARN-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti reassigned YARN-3972: -- Assignee: Raju Bairishetti > Work Preserving AM Restart for MapReduce > > > Key: YARN-3972 > URL: https://issues.apache.org/jira/browse/YARN-3972 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Raju Bairishetti > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623462#comment-14623462 ] Raju Bairishetti commented on YARN-3644: Thanks [~varun_saxena] for the review and comments. bq. The config name is yarn.nodemanager.shutdown.on.RM.connection.failures. All our config names are in lowercase, just for the sake of consistency, maybe RM can be in lowercase too. Thoughts? Agree. Will change it to lower case. bq. The test doesnt really check for whether ConnectionException was thrown or NM Shutdown event was called or not. I ran the test in debugger mode. also. Test is hitting all the source changes. *I agree, I will rewrite this test using Mockito to make it more generic* > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.001.patch, > YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614591#comment-14614591 ] Raju Bairishetti commented on YARN-3644: [~Naganarasimha] Could you kindly review the latest patch? > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.001.patch, > YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3886) Add cumulative wait times of apps at Queue level
Raju Bairishetti created YARN-3886: -- Summary: Add cumulative wait times of apps at Queue level Key: YARN-3886 URL: https://issues.apache.org/jira/browse/YARN-3886 Project: Hadoop YARN Issue Type: Task Components: yarn Reporter: Raju Bairishetti Assignee: Raju Bairishetti Right now, we are having number of apps submitted/failed/killed/running at queue level. We don't have any way to find on which queue apps are waiting more time. I hope adding wait times of apps at queue level will be helpful in viewing the overall queue status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3695: --- Attachment: YARN-3695.01.patch [~jianhe] Thanks for the review. Moved the Precondtion Checks before creating RetryPolicy. So that we can avoid creating policy if the connection timeout values are invalid. > ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. > -- > > Key: YARN-3695 > URL: https://issues.apache.org/jira/browse/YARN-3695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Raju Bairishetti > Attachments: YARN-3695.01.patch, YARN-3695.patch > > > YARN-3646 fix the retry forever policy in RMProxy that it only applies on > limited exceptions rather than all exceptions. Here, we may need the same fix > for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603363#comment-14603363 ] Raju Bairishetti commented on YARN-3644: Seems checkstyle error was not introduced as part of this patch. File had already more than 2000 lines :) . *Check style error:* YarnConfiguration.java:1: File length is 2,036 lines (max allowed is 2,000). > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.001.patch, > YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3695: --- Attachment: YARN-3695.patch > ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. > -- > > Key: YARN-3695 > URL: https://issues.apache.org/jira/browse/YARN-3695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Raju Bairishetti > Attachments: YARN-3695.patch > > > YARN-3646 fix the retry forever policy in RMProxy that it only applies on > limited exceptions rather than all exceptions. Here, we may need the same fix > for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3644: --- Attachment: YARN-3644.003.patch Fixed test case with the newly added changes in the trunk. Override the unRegisterNodeManager(request) method in MyResourceTracker8 class. > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.001.patch, > YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3644: --- Attachment: YARN-3644.002.patch > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.001.patch, > YARN-3644.002.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3644: --- Attachment: YARN-3644.001.patch Created a jira [YARN-3847|https://issues.apache.org/jira/browse/YARN-3847] for *refactoring full test class* [~Naganarasimha] Fixed the couple of review comments. Is it fine to refactoring the test as part of [YARN-3847|https://issues.apache.org/jira/browse/YARN-3847] ? > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3847) Refactor TestNodeStatusUpdater
[ https://issues.apache.org/jira/browse/YARN-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3847: --- Labels: test (was: ) > Refactor TestNodeStatusUpdater > -- > > Key: YARN-3847 > URL: https://issues.apache.org/jira/browse/YARN-3847 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Raju Bairishetti >Assignee: Raju Bairishetti > Labels: test > > Seems there is lots of duplicated/redundant code in > TestNodeStatusUpdater.java. > This is jira for removing the redundant code from the TestNodeStatusUpdater. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3847) Refactor TestNodeStatusUpdater
[ https://issues.apache.org/jira/browse/YARN-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3847: --- Component/s: nodemanager > Refactor TestNodeStatusUpdater > -- > > Key: YARN-3847 > URL: https://issues.apache.org/jira/browse/YARN-3847 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Raju Bairishetti >Assignee: Raju Bairishetti > Labels: test > > Seems there is lots of duplicated/redundant code in > TestNodeStatusUpdater.java. > This is jira for removing the redundant code from the TestNodeStatusUpdater. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3847) Refactor TestNodeStatusUpdater
Raju Bairishetti created YARN-3847: -- Summary: Refactor TestNodeStatusUpdater Key: YARN-3847 URL: https://issues.apache.org/jira/browse/YARN-3847 Project: Hadoop YARN Issue Type: Bug Reporter: Raju Bairishetti Assignee: Raju Bairishetti Seems there is lots of duplicated/redundant code in TestNodeStatusUpdater.java. This is jira for removing the redundant code from the TestNodeStatusUpdater. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583189#comment-14583189 ] Raju Bairishetti commented on YARN-3644: [~amareshwari] [~Naganarasimha] Thanks for the review and comments. [~Naganarasimha] Yes, this jira is only to make NM wait for RM. > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582898#comment-14582898 ] Raju Bairishetti commented on YARN-3644: Could anyone please review the patch? > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562206#comment-14562206 ] Raju Bairishetti commented on YARN-3644: [~hex108] Is there any work pending on the jira to assign yourself or was it assigned yourself by mistake? > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Jun Gong > Attachments: YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3644: --- Attachment: YARN-3644.001.patch > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.001.patch, YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3644: --- Attachment: YARN-3644.patch Intorduced a new config **NODEMANAGER_SHUTSDWON_ON_RM_CONNECTION_FAILURES** to allow the users to take decision on the shutdown of the NM when it is not able to connect to RM. Keeping default value as true to honour the current behavior. > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > Attachments: YARN-3644.patch > > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti reassigned YARN-3644: -- Assignee: Raju Bairishetti > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan >Assignee: Raju Bairishetti > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti reassigned YARN-3695: -- Assignee: Raju Bairishetti > ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. > -- > > Key: YARN-3695 > URL: https://issues.apache.org/jira/browse/YARN-3695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Raju Bairishetti > > YARN-3646 fix the retry forever policy in RMProxy that it only applies on > limited exceptions rather than all exceptions. Here, we may need the same fix > for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3695) EOFException shouldn't be retry forever in RMProxy
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554806#comment-14554806 ] Raju Bairishetti commented on YARN-3695: [~rohithsharma] [~djp] [~devraj.jaiman] Seems I forgot to fix retry policy FOREVER in ServerProxy as part of [YARN-3646|https://issues.apache.org/jira/browse/YARN-3646] ServerProxy.java {code} if (maxWaitTime == -1) { // wait forever. return RetryPolicies.RETRY_FOREVER; } ... Map, RetryPolicy> exceptionToPolicyMap = new HashMap, RetryPolicy>(); exceptionToPolicyMap.put(EOFException.class, retryPolicy); exceptionToPolicyMap.put(ConnectException.class, retryPolicy); ... {code} > EOFException shouldn't be retry forever in RMProxy > -- > > Key: YARN-3695 > URL: https://issues.apache.org/jira/browse/YARN-3695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du > > YARN-3646 fix the retry forever policy that it only applies on limited > exceptions rather than all exceptions. Here, we may want to review these > exceptions. At least, exception EOFException shouldn't retry forever. > {code} > exceptionToPolicyMap.put(EOFException.class, retryPolicy); > exceptionToPolicyMap.put(ConnectException.class, retryPolicy); > exceptionToPolicyMap.put(NoRouteToHostException.class, retryPolicy); > exceptionToPolicyMap.put(UnknownHostException.class, retryPolicy); > exceptionToPolicyMap.put(ConnectTimeoutException.class, retryPolicy); > exceptionToPolicyMap.put(RetriableException.class, retryPolicy); > exceptionToPolicyMap.put(SocketException.class, retryPolicy); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3646: --- Attachment: YARN-3646.002.patch [~rohithsharma] Thanks for the review and comments. Attached a new patch > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.001.patch, YARN-3646.002.patch, YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3646: --- Attachment: YARN-3646.001.patch Added a new unit test in hadoop-yarn-client. [~rohithsharma] Could you please review? Ran the test without starting the RM and then test was getting timeout. Ran the test by starting the RM then client is getting ApplicationNotFoundException for older/invalid appId. {code} rm = new ResourceManager(); rm.init(conf); rm.start(); {code} > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.001.patch, YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550288#comment-14550288 ] Raju Bairishetti commented on YARN-3646: Thanks [~rohithsharma] for the review. Looks like it is mainly an issue with retry policy. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3646: --- Attachment: YARN-3646.patch > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547544#comment-14547544 ] Raju Bairishetti commented on YARN-3644: W can have a new config like NODEMANAGER_ALIVE_ON_RM_CONNECTION_FAILURES? Based on this config value NM takes a decision on shutdown. In this way we can honour the existing behaviour as well. I will provide a patch shortly. Not able to assign myself. Can anyone help me in assigning? > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547537#comment-14547537 ] Raju Bairishetti commented on YARN-3646: [~vinodkv] I will provide a patch shortly. I am not able to assign myself. Can anyone help me in assigning myself? > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547529#comment-14547529 ] Raju Bairishetti commented on YARN-3646: bq. Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default policy is not sufficient, but also RetryPolicies.RetryForever.shouldRetry() should check for Connect exceptions and handle it. Otherwise shouldRetry always return RetryAction.RETRY action. Do we need to catch exception in shouldRetry if we have separate exceptionToPolicy map which contains only connectionException entry. ( like exceptiontoPolicyMap.put(connectionException, FOREVER polcicy)) Seems we do not even require exceptionToPolicy for FOREVER policy if we catch the exception in shouldRetry method. thoughts? > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547040#comment-14547040 ] Raju Bairishetti commented on YARN-3644: [~sandflee] Yes, NM should catch the exception and keeps it alive. Right now, NM shuts down itself only in case of connection failures. NM ignores all other kinds of exceptions and errors while sending heartbeats. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error("Caught exception in status-updater", e); } {code} > Node manager shuts down if unable to connect with RM > > > Key: YARN-3644 > URL: https://issues.apache.org/jira/browse/YARN-3644 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Srikanth Sundarrajan > > When NM is unable to connect to RM, NM shuts itself down. > {code} > } catch (ConnectException e) { > //catch and throw the exception if tried MAX wait time to connect > RM > dispatcher.getEventHandler().handle( > new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); > throw new YarnRuntimeException(e); > {code} > In large clusters, if RM is down for maintenance for longer period, all the > NMs shuts themselves down, requiring additional work to bring up the NMs. > Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side > effects, where non connection failures are being retried infinitely by all > YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544815#comment-14544815 ] Raju Bairishetti commented on YARN-3646: Thanks for the quick response. I have reproduced it with apache 2.6.0 release (HDP 2.2.4 distribution). We are using 2.5.0 version. We are not having *exceptionToPolicyMap* for FOREVER retrypolicy. Updating the exceptionToPolicyMap only for other retry policies. *RetryPolicies.java* {code} static class RetryForever implements RetryPolicy { @Override public RetryAction shouldRetry(Exception e, int retries, int failovers, boolean isIdempotentOrAtMostOnce) throws Exception { return RetryAction.RETRY; } } {code} *RMProxy.java* {code} if (waitForEver) { return RetryPolicies.RETRY_FOREVER; } ... Map, RetryPolicy> exceptionToPolicyMap = new HashMap, RetryPolicy>(); {code} > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
Raju Bairishetti created YARN-3646: -- Summary: Applications are getting stuck some times in case of retry policy forever Key: YARN-3646 URL: https://issues.apache.org/jira/browse/YARN-3646 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Raju Bairishetti We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER retry policy. Yarn client is infinitely retrying in case of exceptions from the RM as it is using retrying policy as FOREVER. The problem is it is retrying for all kinds of exceptions (like ApplicationNotFoundException), even though it is not a connection failure. Due to this my application is not progressing further. *Yarn client should not retry infinitely in case of non connection failures.* We have written a simple yarn-client which is trying to get an application report for an invalid or older appId. ResourceManager is throwing an ApplicationNotFoundException as this is an invalid or older appId. But because of retry policy FOREVER, client is keep on retrying for getting the application report and ResourceManager is throwing ApplicationNotFoundException continuously. {code} private void testYarnClientRetryPolicy() throws Exception{ YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); ApplicationReport report = yarnClient.getApplicationReport(appId); } {code} *RM logs:* {noformat} 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875162 Retry#0 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1430126768987_10645' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875163 Retry#0 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)