[jira] [Assigned] (YARN-261) Ability to kill AM attempts

2015-09-05 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reassigned YARN-261:
--

Assignee: Rohith Sharma K S

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
>Assignee: Rohith Sharma K S
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-261) Ability to kill AM attempts

2015-09-05 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731837#comment-14731837
 ] 

Rohith Sharma K S commented on YARN-261:


Thanks [~aklochkov] for information.. :-) 

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
>Assignee: Rohith Sharma K S
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-05 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731875#comment-14731875
 ] 

Sunil G commented on YARN-4091:
---

Thank you [~leftnoteasy] for the detailed information shared. From your input 
and also synced with [~rohithsharma] and [~nijel] offline, I am trying to 
summarize a view point for this. Very raw information is mentioned for now in 
REST response in example, we ll add detailed information later.

*Adding more diagnostics and debug information to Scheduler will help the user 
to get two levels of knowledge. So If we fetch this information with 2 REST api 
calls, specific reason for potential problem in scheduler can be identified and 
action can be taken*

*1*. What happened to an application recently in Scheduler (like status from 
node heartbeats)

*Example*:
- application might not have got containers it asked
  Reason: Userlimit for the application has reached
- application might still be in pending state, yet to get active.
  Reason: Am resource limit is exhausted, hence app cant be made active

*Benefit for user with this info*:  
   User will get to know the clear problem area to look for along with 
potential reason for it.
*How User can get this info*:
  Via REST api,  debug/diagnostic information can be fetched for a 
queue/application.
*Expected O/P*:
{noformat}
 queue - a:
  application : app1
  appState : RUNNING
  reasonPhrase : NA
  lastContainerAssignmentState : SKIPPED_ASSIGNMENT
  reasonPhrase : Userlimit quota is reached
  application : app2
  appState : ACCEPTED
  reasonPhrase : AM resource limit exhausted
{noformat}
   
 *2*. Data/Metrics information from scheduler which is particular to the 
problem identified in 1.

*Example*:
- User can fetch metrics information via REST such as the current queue 
cap, user limit configured, user limit calculated within scheduler etc.
- User can fetch metrics information via REST such as queue capacity, am 
resource % configured, am resource % calculated within RM, current demand etc.

This two level information will help user to take correct measure in cluster to 
fix the problem, such as increase priority of app, OR change queue of an 
application, OR kill some containers in node manually OR some auto tuning from 
AM also.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient

2015-09-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732079#comment-14732079
 ] 

Hadoop QA commented on YARN-3367:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m  1s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 58s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 46s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 49s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 59s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |   6m 38s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  48m 50s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery |
| Timed out tests | 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync |
|   | org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot |
|   | org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754340/YARN-3367.YARN-2928.001.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / e6afe26 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9016/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9016/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9016/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9016/console |


This message was automatically generated.

> Replace starting a separate thread for post entity with event loop in 
> TimelineClient
> 
>
> Key: YARN-3367
> URL: https://issues.apache.org/jira/browse/YARN-3367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Junping Du
>Assignee: Naganarasimha G R
> Attachments: YARN-3367.YARN-2928.001.patch
>
>
> Since YARN-3039, we add loop in TimelineClient to wait for 
> collectorServiceAddress ready before posting any entity. In consumer of  
> TimelineClient (like AM), we are starting a new thread for each call to get 
> rid of potential deadlock in main thread. This way has at least 3 major 
> defects:
> 1. The consumer need some additional code to wrap a thread before calling 
> putEntities() in TimelineClient.
> 2. It cost many thread resources which is unnecessary.
> 3. The sequence of events could be out of order because each posting 
> operation thread get out of waiting loop randomly.
> We should have something like event loop in TimelineClient side, 
> putEntities() only put related entities into a queue of entities and a 
> separated thread handle to deliver entities in queue to collector via REST 
> call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient

2015-09-05 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3367:

Attachment: YARN-3367.YARN-2928.001.patch

Uploading an initial patch with no test case for this jira :
Some open points which needs more discussion 
# Timelineclient async calls are only to ensure the client need not wait till 
the server response & just return immediately after requesting to post entity 
or even in server side we need to ensure some thing ? As currently we are 
trying to send the async parameter to the server.
# According earlier discussion we had to decide whether to have 2 cross 2 
matric wrt sync/async & writer flush & not flush in server side, but after 
YARN-4061 (Fault tolerant writer for timeline v2), i presume client need not 
ensure much as consistency will be handled in server side and IMO it would be 
sufficent to just have non blocking call for async
# Is it important to maintain the order of events which are sent from sync and 
async ? i.e. Is it req to ensure all the async events are also pushed along 
with the current sync event or is it ok to send only the sync ? (current patch 
just ensures async events are in order) .
# Whether its req to merge entities of multiple async calls as they belong to 
same application ?

Please kindly review and share your thoughts on the above points.
cc /[~sjlee0] Informing you, as you had asked to include you in discussion for 
these points and also you were not watching for this jira 

> Replace starting a separate thread for post entity with event loop in 
> TimelineClient
> 
>
> Key: YARN-3367
> URL: https://issues.apache.org/jira/browse/YARN-3367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Junping Du
>Assignee: Naganarasimha G R
> Attachments: YARN-3367.YARN-2928.001.patch
>
>
> Since YARN-3039, we add loop in TimelineClient to wait for 
> collectorServiceAddress ready before posting any entity. In consumer of  
> TimelineClient (like AM), we are starting a new thread for each call to get 
> rid of potential deadlock in main thread. This way has at least 3 major 
> defects:
> 1. The consumer need some additional code to wrap a thread before calling 
> putEntities() in TimelineClient.
> 2. It cost many thread resources which is unnecessary.
> 3. The sequence of events could be out of order because each posting 
> operation thread get out of waiting loop randomly.
> We should have something like event loop in TimelineClient side, 
> putEntities() only put related entities into a queue of entities and a 
> separated thread handle to deliver entities in queue to collector via REST 
> call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4107) Both RM becomes Active if all zookeepers can not connect to active RM

2015-09-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732143#comment-14732143
 ] 

Karthik Kambatla commented on YARN-4107:


If it fails to connect to ZK, the fencing thread should fail and lead to the RM 
switching to standby state. If not, doing that is likely a better fix for this 
issue. No? 

> Both RM becomes Active if all zookeepers can not connect to active RM
> -
>
> Key: YARN-4107
> URL: https://issues.apache.org/jira/browse/YARN-4107
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4107.1.patch
>
>
> Steps to reproduce:
> 1) Run small randomwriter applications in background
> 2) rm1 is active and rm2 is standby 
> 3) Disconnect all Zks and Active RM
> 4) Check status of both RMs. Both of them are in active state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732144#comment-14732144
 ] 

Karthik Kambatla commented on YARN-4113:


Good catch. Do we have a common JIRA to track the updates to RETRY_FOREVER 
policy? 

> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3591) Resource localization on a bad disk causes subsequent containers failure

2015-09-05 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3591:
---
Summary: Resource localization on a bad disk causes subsequent containers 
failure   (was: Resource Localisation on a bad disk causes subsequent 
containers failure )

> Resource localization on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, 
> YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch, YARN-3591.9.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-05 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732176#comment-14732176
 ] 

He Tianyi commented on YARN-2005:
-

Hi,

I've seen AM failures due to java dependency problem or misinput of parameter, 
in that case, we certainly do not want to block that node.
In future version, are we able to tell whether the failure is caused by node 
itself, or by application (i.e. enforce exit code conventions)?

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL

2015-09-05 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-4119:
---

 Summary:  Expose the NM bind address as an env, so that AM can 
make use of it for exposing tracking URL
 Key: YARN-4119
 URL: https://issues.apache.org/jira/browse/YARN-4119
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


As described in MAPREDUCE-5938, In many security scanning tools its not 
advisable to bind on all network addresses and would be good to bind only on 
the desired address. As AM's can run on any of the nodes it would be better for 
NM to share its bind address as part of Environment variables to the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)