[jira] [Commented] (YARN-4314) Adding container wait time as a metric at queue level and application level.

2015-10-30 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982340#comment-14982340
 ] 

Raju Bairishetti commented on YARN-4314:


I feel adding timestamp to each resource request will be costly and all the 
existing applications will need to migrate to use this metric.  

Had a discussion with [~sriksun] earlier about this approach. Resource request 
is prepared by AM. In future if we want to use this timestamp as priority for 
allocating resources then there is a chance that user/AM can misuse the system 
by saying they have older time stamps.

Thinking about this approach:
AppSchedulingInfo has all the scheduling info about an application. When RM 
receives first resource a request from AM then RM can note down the system time 
as resource request time. Whenever new request comes(i.e. 
UpdateResourceRequest() in AppSchedulingInfo or allocate()) then we can measure 
how many containers were waited till this time from the last request time. We 
can mostly listen on the container request & allocate events. 

 I will put up a detailed doc with all my thoughts & approaches.

> Adding container wait time as a metric at queue level and application level.
> 
>
> Key: YARN-4314
> URL: https://issues.apache.org/jira/browse/YARN-4314
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>
> There is a need for adding the container wait-time which can be tracked at 
> the queue and application level. 
> An application can have two kinds of wait times. One is AM wait time after 
> submission and another is total container wait time between AM asking for 
> containers and getting them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-473) Capacity Scheduler webpage and REST API not showing correct number of pending applications

2015-08-06 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659695#comment-14659695
 ] 

Raju Bairishetti commented on YARN-473:
---

REST API is showing wrong number of pending applications as it was populating 
the value from *pendingApplications set* (i.e. size of pendingApplications 
set). IMO, it should get the value from Queue metrics instead of getting from 
in-memory data structures.

LeafQueue : 
{noformat}
1) *pendingApplications* : all submitted applications will be added to this set 
first. 

2) *activeApplications* :  applications added to this set from 
pendingApplications set if the number of active applications are lesser than 
maximum number of active applications. 
{noformat}

Jmx metrics showing appsPending metric properly. AppsPending will be 
incremented when the application is submitted to Queue and decremented only 
when the application is actually launched(i.e. allocated some resources to it). 

IMO, Rest call for queue info should also use queue metrics instead of 
depending on other data structures. Applications from pendingApplications set 
will be removed before the launching of application.

pendingApplciations set contains the applications which are not ready for 
schedulable at this moment.

One more interesting fact, CapacitySchedulerPage(in UI) has "Num 
Non-Schedulable Applications" and  getting the value form pendingApplications 
set. Rest API call is correct in this case.

 I am thinking couple of approaches to fix this issue:
{noformat}
1) Rename the pendingApplications to nonSchedulableApplications in 
CapacitySchedulerLeafQueueInfo class and introduce new fields 
(pendingApplications) in the CapacitySchedulerLeafQueueInfo and get this value 
from QueueMetrics.
  
 refactor activeApplications to schedulableApplications and introducing a 
new field(activeApplications) for the same which tells exactly how many AMs are 
there.

2. Refactor "Num Non-Schedulable Applications:" to "Num of Pending 
Applications"  and  refactor "Num Schedulable Applications" to 
activeApplications. Update properly through QueueMetrics instead of depending 
on in-memory data structures in the LeafQueue to get the values.
{noformat}


> Capacity Scheduler webpage and REST API not showing correct number of pending 
> applications
> --
>
> Key: YARN-473
> URL: https://issues.apache.org/jira/browse/YARN-473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Mit Desai
>  Labels: usability
>
> The Capacity Scheduler REST API 
> (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API)
>  is not returning the correct number of pending applications.  
> numPendingApplications is almost always zero, even if there are dozens of 
> pending apps.
> In investigating this, I discovered that the Resource Manager's Scheduler 
> webpage is also showing an incorrect but different number of pending 
> applications.  For example, the cluster I'm looking at right now currently 
> has 15 applications in the ACCEPTED state, but the Cluster Metrics table near 
> the top of the page says there are only 2 pending apps.  The REST API says 
> there are zero pending apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3972) Work Preserving AM Restart for MapReduce

2015-07-24 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti reassigned YARN-3972:
--

Assignee: Raju Bairishetti

> Work Preserving AM Restart for MapReduce
> 
>
> Key: YARN-3972
> URL: https://issues.apache.org/jira/browse/YARN-3972
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Raju Bairishetti
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-11 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623462#comment-14623462
 ] 

Raju Bairishetti commented on YARN-3644:


Thanks [~varun_saxena] for the review and comments.

bq. The config name is yarn.nodemanager.shutdown.on.RM.connection.failures. All 
our config names are in lowercase, just for the sake of consistency, maybe RM 
can be in lowercase too. Thoughts?
  Agree. Will change it to lower case.

bq. The test doesnt really check for whether ConnectionException was thrown or 
NM Shutdown event was called or not.
   I ran the test in debugger mode. also. Test is hitting all the source 
changes. *I agree, I will rewrite this test using Mockito  to make it more 
generic*

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
> YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-05 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614591#comment-14614591
 ] 

Raju Bairishetti commented on YARN-3644:


[~Naganarasimha] Could you kindly review the latest patch?

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
> YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3886) Add cumulative wait times of apps at Queue level

2015-07-05 Thread Raju Bairishetti (JIRA)
Raju Bairishetti created YARN-3886:
--

 Summary: Add cumulative wait times of apps at Queue level
 Key: YARN-3886
 URL: https://issues.apache.org/jira/browse/YARN-3886
 Project: Hadoop YARN
  Issue Type: Task
  Components: yarn
Reporter: Raju Bairishetti
Assignee: Raju Bairishetti


Right now, we are having number of apps submitted/failed/killed/running at 
queue level. We don't have any way to find on which queue apps are waiting more 
time. 

I hope adding wait times of apps at queue level will be helpful in viewing the 
overall queue status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3695:
---
Attachment: YARN-3695.01.patch

[~jianhe] Thanks for the review.

Moved the Precondtion Checks before creating RetryPolicy. So that we can avoid 
creating policy if the connection timeout values are invalid.

> ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
> --
>
> Key: YARN-3695
> URL: https://issues.apache.org/jira/browse/YARN-3695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Raju Bairishetti
> Attachments: YARN-3695.01.patch, YARN-3695.patch
>
>
> YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
> limited exceptions rather than all exceptions. Here, we may need the same fix 
> for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603363#comment-14603363
 ] 

Raju Bairishetti commented on YARN-3644:


Seems checkstyle error was not introduced as part of this patch. File had 
already more than 2000 lines :) .
*Check style error:*  YarnConfiguration.java:1: File length is 2,036 lines (max 
allowed is 2,000).

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
> YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3695:
---
Attachment: YARN-3695.patch

> ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
> --
>
> Key: YARN-3695
> URL: https://issues.apache.org/jira/browse/YARN-3695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Raju Bairishetti
> Attachments: YARN-3695.patch
>
>
> YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
> limited exceptions rather than all exceptions. Here, we may need the same fix 
> for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3644:
---
Attachment: YARN-3644.003.patch

Fixed test case with the newly added changes in the trunk. Override the   
unRegisterNodeManager(request) method in MyResourceTracker8 class.


> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
> YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-24 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3644:
---
Attachment: YARN-3644.002.patch

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
> YARN-3644.002.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-24 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3644:
---
Attachment: YARN-3644.001.patch

Created a jira [YARN-3847|https://issues.apache.org/jira/browse/YARN-3847] for 
*refactoring full test class*

[~Naganarasimha] Fixed the couple of review comments. Is it fine to refactoring 
the test as part of [YARN-3847|https://issues.apache.org/jira/browse/YARN-3847] 
?




> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.001.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3847) Refactor TestNodeStatusUpdater

2015-06-24 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3847:
---
Labels: test  (was: )

> Refactor TestNodeStatusUpdater
> --
>
> Key: YARN-3847
> URL: https://issues.apache.org/jira/browse/YARN-3847
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Raju Bairishetti
>Assignee: Raju Bairishetti
>  Labels: test
>
> Seems there is lots of duplicated/redundant code in 
> TestNodeStatusUpdater.java. 
> This is jira for removing the redundant code from the TestNodeStatusUpdater.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3847) Refactor TestNodeStatusUpdater

2015-06-24 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3847:
---
Component/s: nodemanager

> Refactor TestNodeStatusUpdater
> --
>
> Key: YARN-3847
> URL: https://issues.apache.org/jira/browse/YARN-3847
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Raju Bairishetti
>Assignee: Raju Bairishetti
>  Labels: test
>
> Seems there is lots of duplicated/redundant code in 
> TestNodeStatusUpdater.java. 
> This is jira for removing the redundant code from the TestNodeStatusUpdater.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3847) Refactor TestNodeStatusUpdater

2015-06-24 Thread Raju Bairishetti (JIRA)
Raju Bairishetti created YARN-3847:
--

 Summary: Refactor TestNodeStatusUpdater
 Key: YARN-3847
 URL: https://issues.apache.org/jira/browse/YARN-3847
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Raju Bairishetti
Assignee: Raju Bairishetti


Seems there is lots of duplicated/redundant code in TestNodeStatusUpdater.java. 

This is jira for removing the redundant code from the TestNodeStatusUpdater.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-12 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583189#comment-14583189
 ] 

Raju Bairishetti commented on YARN-3644:


[~amareshwari] [~Naganarasimha] Thanks for the review and comments.

[~Naganarasimha] Yes,  this jira is only to make NM wait for RM.

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-11 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582898#comment-14582898
 ] 

Raju Bairishetti commented on YARN-3644:


Could anyone please review the patch?

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-27 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562206#comment-14562206
 ] 

Raju Bairishetti commented on YARN-3644:


[~hex108] Is there any work pending on the jira to assign yourself or was it 
assigned yourself by mistake?


> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Jun Gong
> Attachments: YARN-3644.001.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3644:
---
Attachment: YARN-3644.001.patch

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.001.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3644:
---
Attachment: YARN-3644.patch

Intorduced a new config **NODEMANAGER_SHUTSDWON_ON_RM_CONNECTION_FAILURES** to 
allow the users to take decision on the shutdown of the NM when it is not able 
to connect to RM.

Keeping default value as true to honour the current behavior.

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
> Attachments: YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-22 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti reassigned YARN-3644:
--

Assignee: Raju Bairishetti

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>Assignee: Raju Bairishetti
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-05-21 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti reassigned YARN-3695:
--

Assignee: Raju Bairishetti

> ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
> --
>
> Key: YARN-3695
> URL: https://issues.apache.org/jira/browse/YARN-3695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Raju Bairishetti
>
> YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
> limited exceptions rather than all exceptions. Here, we may need the same fix 
> for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3695) EOFException shouldn't be retry forever in RMProxy

2015-05-21 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554806#comment-14554806
 ] 

Raju Bairishetti commented on YARN-3695:


[~rohithsharma] [~djp] [~devraj.jaiman] Seems I forgot to fix retry policy 
FOREVER in ServerProxy as part of 
[YARN-3646|https://issues.apache.org/jira/browse/YARN-3646]

ServerProxy.java
{code}
if (maxWaitTime == -1) {
  // wait forever.
  return RetryPolicies.RETRY_FOREVER;
}

   ...

Map, RetryPolicy> exceptionToPolicyMap =
new HashMap, RetryPolicy>();
exceptionToPolicyMap.put(EOFException.class, retryPolicy);
exceptionToPolicyMap.put(ConnectException.class, retryPolicy);
...
{code}

> EOFException shouldn't be retry forever in RMProxy
> --
>
> Key: YARN-3695
> URL: https://issues.apache.org/jira/browse/YARN-3695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>
> YARN-3646 fix the retry forever policy that it only applies on limited 
> exceptions rather than all exceptions. Here, we may want to review these 
> exceptions. At least, exception EOFException shouldn't retry forever.
> {code}
> exceptionToPolicyMap.put(EOFException.class, retryPolicy);
> exceptionToPolicyMap.put(ConnectException.class, retryPolicy);
> exceptionToPolicyMap.put(NoRouteToHostException.class, retryPolicy);
> exceptionToPolicyMap.put(UnknownHostException.class, retryPolicy);
> exceptionToPolicyMap.put(ConnectTimeoutException.class, retryPolicy);
> exceptionToPolicyMap.put(RetriableException.class, retryPolicy);
> exceptionToPolicyMap.put(SocketException.class, retryPolicy);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-20 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3646:
---
Attachment: YARN-3646.002.patch

[~rohithsharma] Thanks for the review and comments. Attached a new patch

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.001.patch, YARN-3646.002.patch, YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3646:
---
Attachment: YARN-3646.001.patch

Added a new unit test in hadoop-yarn-client. [~rohithsharma] Could you please 
review?

Ran the test without starting the RM and then test was getting timeout.

Ran the test by starting the RM then client is getting 
ApplicationNotFoundException for older/invalid appId.
{code}
  rm = new ResourceManager();
  rm.init(conf);
  rm.start();
{code}

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.001.patch, YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550288#comment-14550288
 ] 

Raju Bairishetti commented on YARN-3646:


Thanks [~rohithsharma] for the review.

 Looks like it is mainly an issue with retry policy.



> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3646:
---
Attachment: YARN-3646.patch

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547544#comment-14547544
 ] 

Raju Bairishetti commented on YARN-3644:


W can have a new config like NODEMANAGER_ALIVE_ON_RM_CONNECTION_FAILURES? Based 
on this config value NM takes a decision on shutdown. In this way we can honour 
the existing behaviour as well.

I will provide a patch shortly. Not able to assign myself. Can anyone help me 
in assigning?

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-17 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547537#comment-14547537
 ] 

Raju Bairishetti commented on YARN-3646:


[~vinodkv] I will provide a patch shortly. 
 I am not able to assign myself. Can anyone help me in assigning myself? 

 

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-17 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547529#comment-14547529
 ] 

Raju Bairishetti commented on YARN-3646:


bq. Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default 
policy is not sufficient, but also RetryPolicies.RetryForever.shouldRetry() 
should check for Connect exceptions and handle it. Otherwise shouldRetry always 
return RetryAction.RETRY action.

 Do we need to catch exception in shouldRetry if we have separate 
exceptionToPolicy map  which contains only connectionException entry. ( like 
exceptiontoPolicyMap.put(connectionException, FOREVER polcicy))

Seems we do not even require exceptionToPolicy for FOREVER policy if we catch 
the exception in shouldRetry method.

thoughts?

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-16 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547040#comment-14547040
 ] 

Raju Bairishetti commented on YARN-3644:


[~sandflee] Yes, NM should catch the exception and keeps it alive.

Right now, NM shuts down itself only in case of connection failures. NM ignores 
all other kinds of exceptions and errors while sending heartbeats.
{code}
 } catch (ConnectException e) {
//catch and throw the exception if tried MAX wait time to connect RM
dispatcher.getEventHandler().handle(
new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
throw new YarnRuntimeException(e);
 } catch (Throwable e) {

// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
} 
{code}

> Node manager shuts down if unable to connect with RM
> 
>
> Key: YARN-3644
> URL: https://issues.apache.org/jira/browse/YARN-3644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>   } catch (ConnectException e) {
> //catch and throw the exception if tried MAX wait time to connect 
> RM
> dispatcher.getEventHandler().handle(
> new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
> throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544815#comment-14544815
 ] 

Raju Bairishetti commented on YARN-3646:


Thanks for the quick response.

I have reproduced it with apache 2.6.0 release (HDP 2.2.4 distribution). We are 
using 2.5.0 version.

We are not having *exceptionToPolicyMap* for FOREVER retrypolicy. Updating the 
exceptionToPolicyMap only for other retry policies.

*RetryPolicies.java*
{code}
static class RetryForever implements RetryPolicy {
@Override
public RetryAction shouldRetry(Exception e, int retries, int failovers,
boolean isIdempotentOrAtMostOnce) throws Exception {
  return RetryAction.RETRY;
}
  }
{code}

*RMProxy.java*
{code}
if (waitForEver) {
  return RetryPolicies.RETRY_FOREVER;
}

...

Map, RetryPolicy> exceptionToPolicyMap =
new HashMap, RetryPolicy>();
{code}

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Raju Bairishetti (JIRA)
Raju Bairishetti created YARN-3646:
--

 Summary: Applications are getting stuck some times in case of 
retry policy forever
 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti


We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER retry 
policy.

Yarn client is infinitely retrying in case of exceptions from the RM as it is 
using retrying policy as FOREVER. The problem is it is retrying for all kinds 
of exceptions (like ApplicationNotFoundException), even though it is not a 
connection failure. Due to this my application is not progressing further.

*Yarn client should not retry infinitely in case of non connection failures.*

We have written a simple yarn-client which is trying to get an application 
report for an invalid  or older appId. ResourceManager is throwing an 
ApplicationNotFoundException as this is an invalid or older appId.  But because 
of retry policy FOREVER, client is keep on retrying for getting the application 
report and ResourceManager is throwing ApplicationNotFoundException 
continuously.

{code}
private void testYarnClientRetryPolicy() throws  Exception{
YarnConfiguration conf = new YarnConfiguration();
conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1);
YarnClient yarnClient = YarnClient.createYarnClient();
yarnClient.init(conf);
yarnClient.start();
ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645);
ApplicationReport report = yarnClient.getApplicationReport(appId);
}
{code}


*RM logs:*

{noformat}

15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.14.120.231:61621 Call#875162 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
with id 'application_1430126768987_10645' doesn't exist in RM.
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.14.120.231:61621 Call#875163 Retry#0


{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)