[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117073#comment-14117073
 ] 

Zhijie Shen commented on YARN-611:
--

[~xgong], thanks for working on this issue, and I have a couple of comments 
upon the latest solution.

1. *API Change*: I'm not sure whether it is really necessary to have a 
completely standalone proto messages for ApplicationRetryPolicy's 
implementations. It sounds an overkill to me. In fact, 
MaxApplicationRetriesPolicy seems to be a special case of 
WindowedApplicationRetriesPolicy, where the window size is to be infinitely 
large, such that the number of failures will never be reset. Therefore, why not 
simply adding one more field (i.e., resetTimeWindow) in 
ApplicationSubmissionContext. When  resetTimeWindow = 0 or -1, it means the 
window size is unbounded, and failure number will never be reset. On the other 
side, when resetTimeWindow is set to  0, the failure number will no take the 
failures which happen out of the window into account.

Moreover, a minor issue here is that ApplicationRetryPolicy is actually not a 
real abstraction, which has the flags of both implementations's context.

2. *Failure Window*: If I understood correctly, 
WindowedApplicationRetriesPolicy uses a jumped window instead of a *moving* 
window. It may be problematic. Here's the example. Let's say the window size is 
2H, and the maxAttempts is 100. From 0:00 to 1:00, there happened 1 failure. 
From 1:00 to 2:00, there happened 98 failures. At 2:00 the reset logic is 
triggered, such that all the 99 failures won't be taken into account any more. 
From 2:00 to 3:00, there happened 2 failures. The total failures at this time 
is 2, because the previous 99 failures have been reset. However, from the point 
of view at 3:00, looking back to the 2H window, 101 failures have happened. In 
fact, the job should run out of retry quotas at this point.

IMHO, the reasonable way is to make use a certain data structure (e.g., 
fixed-size FIFO queue) to always keep tracking the number failures that 
happened in past configured time window, and update the data structure upon a 
failure happens.

3. *Multi-threading*: I'm not sure whether it is going to work for a big 
cluster with hundreds of even thousands concurrent applications to have an 
individual thread to reset the failure number. Though 
WindowedApplicationRetriesPolicy is particularly designed for the long running 
services, I don't think we have restricted the normal application to use it, 
and it's not reasonable to make this restriction. Therefore, it's likely to 
have that many threads for an RM if all apps choose to use this policy. 
However, AFAIK, the number of threads in a process is limited. Importantly, the 
reset logic is not computation intensive, such that it's wasting thread 
resources to have one for each app.

Maybe we can make use a thread pool, or even have a single thread (e.g., a 
service of RM) to take care of all the apps' reset windows. Moreover, IMHO, if 
the aforementioned data structure is defined properly, we may not need to have 
a separate thread to the reset work, as the failure number in the past time of 
the configured window size is updated every time the failure happens.

4. *Affecting RMStateStore*: I'm not sure why it is necessary to persist the 
end time into RMStateStore, which seems not to be really used for reseting 
the window. One think I can image about RM restarting is how to store the 
failure number in the past time of the configured window size, if we want to 
make sure after RM restarting, RM is still able to trace back to the whole past 
time window for the failure number. But I think we can do it separately.

 Add an AM retry count reset window to YARN RM
 -

 Key: YARN-611
 URL: https://issues.apache.org/jira/browse/YARN-611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
 Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
 YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch


 YARN currently has the following config:
 yarn.resourcemanager.am.max-retries
 This config defaults to 2, and defines how many times to retry a failed AM 
 before failing the whole YARN job. YARN counts an AM as failed if the node 
 that it was running on dies (the NM will timeout, which counts as a failure 
 for the AM), or if the AM dies.
 This configuration is insufficient for long running (or infinitely running) 
 YARN jobs, since the machine (or NM) that the AM is running on will 
 eventually need to be restarted (or the machine/NM will fail). In such an 
 event, the AM has not done anything wrong, but this is counted as a failure 
 by the RM. Since 

[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117080#comment-14117080
 ] 

Hadoop QA commented on YARN-611:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663467/YARN-611.5.patch
  against trunk revision 258c7d0.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4795//console

This message is automatically generated.

 Add an AM retry count reset window to YARN RM
 -

 Key: YARN-611
 URL: https://issues.apache.org/jira/browse/YARN-611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
 Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
 YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch


 YARN currently has the following config:
 yarn.resourcemanager.am.max-retries
 This config defaults to 2, and defines how many times to retry a failed AM 
 before failing the whole YARN job. YARN counts an AM as failed if the node 
 that it was running on dies (the NM will timeout, which counts as a failure 
 for the AM), or if the AM dies.
 This configuration is insufficient for long running (or infinitely running) 
 YARN jobs, since the machine (or NM) that the AM is running on will 
 eventually need to be restarted (or the machine/NM will fail). In such an 
 event, the AM has not done anything wrong, but this is counted as a failure 
 by the RM. Since the retry count for the AM is never reset, eventually, at 
 some point, the number of machine/NM failures will result in the AM failure 
 count going above the configured value for 
 yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
 job as failed, and shut it down. This behavior is not ideal.
 I propose that we add a second configuration:
 yarn.resourcemanager.am.retry-count-window-ms
 This configuration would define a window of time that would define when an AM 
 is well behaved, and it's safe to reset its failure count back to zero. 
 Every time an AM fails the RmAppImpl would check the last time that the AM 
 failed. If the last failure was less than retry-count-window-ms ago, and the 
 new failure count is  max-retries, then the job should fail. If the AM has 
 never failed, the retry count is  max-retries, or if the last failure was 
 OUTSIDE the retry-count-window-ms, then the job should be restarted. 
 Additionally, if the last failure was outside the retry-count-window-ms, then 
 the failure count should be set back to 0.
 This would give developers a way to have well-behaved AMs run forever, while 
 still failing mis-behaving AMs after a short period of time.
 I think the work to be done here is to change the RmAppImpl to actually look 
 at app.attempts, and see if there have been more than max-retries failures in 
 the last retry-count-window-ms milliseconds. If there have, then the job 
 should fail, if not, then the job should go forward. Additionally, we might 
 also need to add an endTime in either RMAppAttemptImpl or 
 RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
 failure.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-01 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117143#comment-14117143
 ] 

Remus Rusanu commented on YARN-2198:


1. nativeio.c: Should we return null here?
RR: Fixed

2.Nit: nativeio code uses different naming convention for local variables. 
Please try to be consistent with the rest of the file.
RR: Fixed

3. nativeio.c: Nit: I would move throw_ioe if check before done:, the code flow 
will be less error prone 
RR: fixed

4. winutils_process_stub.c: Can {{env-NewGlobalRef())) return null/throw? 
Should we handle this?
RR: Fixed

5. winutils_process_stub.c: You should properly handle the GetExitCodeProcess() 
failure case.
RR: fixed

6. winutils_process_stub.c:Init to INVALID_HANDLE_VALUE?
RR: Fixed

7. client.c: Are RPC_STATUS error codes compatible with winerror codes? 
(semantic around checking for error)
RR: From my experiments they are compatible. FormatMessage gets the right 
message for RPC statuses

8. config.cpp: Wondering if there is a way to get to config files without 
adding a dependency on env variables?
RR: config location is now ../etc/hadoop/wsce-site.xml relative to exe. It is 
defined in pom.xml

9. config.cpp: This error check is unintuitive. Can you please be more explicit?
RR: fixed (no longer applies because only one file is checked)

10. config.cpp: Are SAL annotations correct? For strings one would usually use 
__out_ecount()?
RR: Fixed, and it was  broken all over, thanks for catching it

11. config.cpp: SAL annotation __out_bcount? Also outLen-len in the annotation.
RR: fixed

11. config.cpp: This should be before StringCbPrintf to guarantee that CoInit 
and CoUninit are balanced. 
RR: fixed

12. hdpwinutilsvc.idl: Name does not seem appropriate for apache... possibly 
name it just winutilsvc.idl. Should we use spaces in this file for consistency? 
RR: fixes all names as hadoopwinutilsvc

13. winutils.h:__in_bcount(len) - __in_ecount(len)
RR: fixed

14. libwinutils.c: I'm wondering if this is good opportunity to introduce 
unittests for our C code, as the complexity started increasing beyond just 
windows OS calls, where there is little value in unittesting. 
RR: Not fixed. I will come back later and add units here, but the core work 
(LRPC, SCM, logon user and create process) are basically untestable from C unit 
test.

15. libwinutils.c: Should we deallocate this when BuildSecurityDescriptor fails?
RR: is alloca, so it doesn't need dealloc. 

I don't think it is required to do this now, just wanted to bring it up: if our 
native codebase continues to grow at this pace we should consider introducing 
smart pointers. It is becoming impossibly hard to properly manage the memory in 
all success/failure cases. This becomes more important now that we have long 
running NM native client and winutils service. 
RR: the whole winutils/libwinutils code style is early 90's Petzold Windows 
code style. I'm not a fan of it, but I kept all new code consistent with this 
style. Moving to C++ RAI would be better, but I don;t want to do it piecemeal. 
Some other time.

16. What is the behaviour of calling winutils service. Will this command 
install and start a winutils.exe service under SYSTEM account, and exit?
RR: no. SCM instalation/config is left to SCM tools (eg. sc.exe). winutils 
service is the command line to start the service (it starts, register entry 
point with SCM, waits for SCM commands).

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, 
 YARN-2198.separation.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires a the process launching the container to be LocalSystem or 
 a member of the a local Administrators group. Since the process in question 
 is the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC 

[jira] [Created] (YARN-2485) Fix WSCE folder/file/classpathJar permission/order when running as non-admin

2014-09-01 Thread Remus Rusanu (JIRA)
Remus Rusanu created YARN-2485:
--

 Summary: Fix WSCE folder/file/classpathJar permission/order when 
running as non-admin
 Key: YARN-2485
 URL: https://issues.apache.org/jira/browse/YARN-2485
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu


The WSCE creates the local, usercache, filecache appcache dirs in the normal 
DefaultContainerExecutor way, and then assigns ownership to the userprocess. 
The WSCE configured group is added, but the permission masks used (710) do no 
give write permissions on the appcache/filecache/usercache folder to the NM 
itself.

The creation of these folders, as well as the creation of the temporary 
classPath jar files must succeed even after thes file/dir ownership is 
relinquished to the task user and the NM does not run as a local Administrator. 

LCE handles all these dirs inside the container-executor app (root). The 
classpathJar issue does not exists on Linux.

The dirs can be handled by simply delaying the transfer (create all dirs and 
temp files, then assign ownership in bulk) but the task classpathJar is 
'special' and needs some refactoring of the NM launch sequence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-09-01 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117327#comment-14117327
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

A latest patch is ready for review. I also consider we can separate RetryCache 
support on separate JIRA to meet a deadline of 2.6 release. What do you think? 
Please let me know if I should do so.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, 
 YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, 
 YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117458#comment-14117458
 ] 

Junping Du commented on YARN-2033:
--

+1. Latest patch LGTM. Will commit it tomorrow if no new comments from others.

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
 YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
 YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, 
 YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
 YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch


 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2448) RM should expose the name of the ResourceCalculator being used when AMs register

2014-09-01 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117506#comment-14117506
 ] 

Varun Vasudev commented on YARN-2448:
-

[~sandyr], [~kasha] thanks for your extremely helpful input. I think what 
[~sandyr] is suggesting should be ok. Is it ok to generalize it to return a 
representation of the resource types that the scheduler considers as part of 
its functioning? So that in the future if we add support for more resource 
types, we don't have to change much?

 RM should expose the name of the ResourceCalculator being used when AMs 
 register
 

 Key: YARN-2448
 URL: https://issues.apache.org/jira/browse/YARN-2448
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch


 The RM should expose the name of the ResourceCalculator being used when AMs 
 register, as part of the RegisterApplicationMasterResponse.
 This will allow applications to make better decisions when scheduling. 
 MapReduce for example, only looks at memory when deciding it's scheduling, 
 even though the RM could potentially be using the DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2486) FileSystem counters can overflow for large number of readOps, largeReadOps, writeOps

2014-09-01 Thread Swapnil Daingade (JIRA)
Swapnil Daingade created YARN-2486:
--

 Summary: FileSystem counters can overflow for large number of 
readOps, largeReadOps, writeOps
 Key: YARN-2486
 URL: https://issues.apache.org/jira/browse/YARN-2486
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Swapnil Daingade
Priority: Minor


The org.apache.hadoop.fs.FileSystem.Statistics.StatisticsData class defines 
readOps, largeReadOps, writeOps as int. Also the The 
org.apache.hadoop.fs.FileSystem.Statistics class has methods like getReadOps(), 
getLargeReadOps() and getWriteOps() that return int. These int values can 
overflow if the exceed 2^31-1 showing negative values. It would be nice if 
these can be changed to long.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2486) FileSystem counters can overflow for large number of readOps, largeReadOps, writeOps

2014-09-01 Thread Gary Steelman (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117741#comment-14117741
 ] 

Gary Steelman commented on YARN-2486:
-

I'd really like to see these as long types instead of int, thanks for 
reporting! Are there other places where counters are int types where we should 
change them to long types?

 FileSystem counters can overflow for large number of readOps, largeReadOps, 
 writeOps
 

 Key: YARN-2486
 URL: https://issues.apache.org/jira/browse/YARN-2486
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Swapnil Daingade
Priority: Minor

 The org.apache.hadoop.fs.FileSystem.Statistics.StatisticsData class defines 
 readOps, largeReadOps, writeOps as int. Also the The 
 org.apache.hadoop.fs.FileSystem.Statistics class has methods like 
 getReadOps(), getLargeReadOps() and getWriteOps() that return int. These int 
 values can overflow if the exceed 2^31-1 showing negative values. It would be 
 nice if these can be changed to long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2486) FileSystem counters can overflow for large number of readOps, largeReadOps, writeOps

2014-09-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117800#comment-14117800
 ] 

Sandy Ryza commented on YARN-2486:
--

Unfortunately these methods were made public in 2.5, so we can't change their 
signatures.  We can, however, add versions with new names that return longs.

 FileSystem counters can overflow for large number of readOps, largeReadOps, 
 writeOps
 

 Key: YARN-2486
 URL: https://issues.apache.org/jira/browse/YARN-2486
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0, 2.4.1
Reporter: Swapnil Daingade
Priority: Minor

 The org.apache.hadoop.fs.FileSystem.Statistics.StatisticsData class defines 
 readOps, largeReadOps, writeOps as int. Also the The 
 org.apache.hadoop.fs.FileSystem.Statistics class has methods like 
 getReadOps(), getLargeReadOps() and getWriteOps() that return int. These int 
 values can overflow if the exceed 2^31-1 showing negative values. It would be 
 nice if these can be changed to long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2487) Need to support timeout of AM When no containers are assigned to it for a defined period

2014-09-01 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-2487:
---

 Summary: Need to support timeout of AM When no containers are 
assigned to it for a defined period
 Key: YARN-2487
 URL: https://issues.apache.org/jira/browse/YARN-2487
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


 There are some scenarios where AM will not get containers and indefinetely 
waiting. We faced one such sceanrio which makes the applications to get hung : 
Consider a cluster setup which has 2 NMS of each 8GB resource,
And 2 applications are launched in the default queue where in each AM is taking 
2 GB each.
Each AM is placed in each of the NM. Now each AM is requesting for container of 
7Gb  mem resource .
As in each NM only 6GB resource is available both the applications are hung 
forever.

To avoid such scenarios i would to propose 
generic timeout feature for all AM's @ the yarn side such that if no containers 
are assigned for an application for a defined period than yarn can timeout the 
application attempt.
Default can be set to 0 where in RM will not timeout the app attempt and user 
can set his own timeout when he submits the application



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2487) Need to support timeout of AM When no containers are assigned to it for a defined period

2014-09-01 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2487:

Description: 
 There are some scenarios where AM will not get containers and indefinitely 
waiting. We faced one such sceanrio which makes the applications to get hung : 
Consider a cluster setup which has 2 NMS of each 8GB resource,
And 2 applications(MR2) are launched in the default queue where in each AM is 
taking 2 GB each.
Each AM is placed in each of the NM. Now each AM is requesting for container of 
7Gb  mem resource .
As in each NM only 6GB resource is available both the applications are hung 
forever.

To avoid such scenarios i would like to propose 
generic timeout feature for all AM's in yarn, such that if no containers are 
assigned for an application for a defined period than yarn can timeout the 
application attempt.
Default can be set to 0 where in RM will not timeout the app attempt and user 
can set his own timeout when he submits the application

  was:
 There are some scenarios where AM will not get containers and indefinetely 
waiting. We faced one such sceanrio which makes the applications to get hung : 
Consider a cluster setup which has 2 NMS of each 8GB resource,
And 2 applications are launched in the default queue where in each AM is taking 
2 GB each.
Each AM is placed in each of the NM. Now each AM is requesting for container of 
7Gb  mem resource .
As in each NM only 6GB resource is available both the applications are hung 
forever.

To avoid such scenarios i would to propose 
generic timeout feature for all AM's @ the yarn side such that if no containers 
are assigned for an application for a defined period than yarn can timeout the 
application attempt.
Default can be set to 0 where in RM will not timeout the app attempt and user 
can set his own timeout when he submits the application


 Need to support timeout of AM When no containers are assigned to it for a 
 defined period
 

 Key: YARN-2487
 URL: https://issues.apache.org/jira/browse/YARN-2487
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R

  There are some scenarios where AM will not get containers and indefinitely 
 waiting. We faced one such sceanrio which makes the applications to get hung 
 : 
 Consider a cluster setup which has 2 NMS of each 8GB resource,
 And 2 applications(MR2) are launched in the default queue where in each AM is 
 taking 2 GB each.
 Each AM is placed in each of the NM. Now each AM is requesting for container 
 of 7Gb  mem resource .
 As in each NM only 6GB resource is available both the applications are hung 
 forever.
 To avoid such scenarios i would like to propose 
 generic timeout feature for all AM's in yarn, such that if no containers are 
 assigned for an application for a defined period than yarn can timeout the 
 application attempt.
 Default can be set to 0 where in RM will not timeout the app attempt and user 
 can set his own timeout when he submits the application



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)