[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732176#comment-14732176 ] He Tianyi commented on YARN-2005: - Hi, I've seen AM failures due to java dependency problem or misinput of parameter, in that case, we certainly do not want to block that node. In future version, are we able to tell whether the failure is caused by node itself, or by application (i.e. enforce exit code conventions)? > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL
Naganarasimha G R created YARN-4119: --- Summary: Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL Key: YARN-4119 URL: https://issues.apache.org/jira/browse/YARN-4119 Project: Hadoop YARN Issue Type: Improvement Reporter: Naganarasimha G R Assignee: Naganarasimha G R As described in MAPREDUCE-5938, In many security scanning tools its not advisable to bind on all network addresses and would be good to bind only on the desired address. As AM's can run on any of the nodes it would be better for NM to share its bind address as part of Environment variables to the container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732144#comment-14732144 ] Karthik Kambatla commented on YARN-4113: Good catch. Do we have a common JIRA to track the updates to RETRY_FOREVER policy? > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3591) Resource localization on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3591: --- Summary: Resource localization on a bad disk causes subsequent containers failure (was: Resource Localisation on a bad disk causes subsequent containers failure ) > Resource localization on a bad disk causes subsequent containers failure > - > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, > YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch, YARN-3591.9.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4107) Both RM becomes Active if all zookeepers can not connect to active RM
[ https://issues.apache.org/jira/browse/YARN-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732143#comment-14732143 ] Karthik Kambatla commented on YARN-4107: If it fails to connect to ZK, the fencing thread should fail and lead to the RM switching to standby state. If not, doing that is likely a better fix for this issue. No? > Both RM becomes Active if all zookeepers can not connect to active RM > - > > Key: YARN-4107 > URL: https://issues.apache.org/jira/browse/YARN-4107 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-4107.1.patch > > > Steps to reproduce: > 1) Run small randomwriter applications in background > 2) rm1 is active and rm2 is standby > 3) Disconnect all Zks and Active RM > 4) Check status of both RMs. Both of them are in active state -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732079#comment-14732079 ] Hadoop QA commented on YARN-3367: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 16m 1s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 58s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 46s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 41s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 49s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 59s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 6m 38s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 48m 50s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery | | Timed out tests | org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync | | | org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot | | | org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754340/YARN-3367.YARN-2928.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / e6afe26 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9016/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9016/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9016/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9016/console | This message was automatically generated. > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3367: Attachment: YARN-3367.YARN-2928.001.patch Uploading an initial patch with no test case for this jira : Some open points which needs more discussion # Timelineclient async calls are only to ensure the client need not wait till the server response & just return immediately after requesting to post entity or even in server side we need to ensure some thing ? As currently we are trying to send the async parameter to the server. # According earlier discussion we had to decide whether to have 2 cross 2 matric wrt sync/async & writer flush & not flush in server side, but after YARN-4061 (Fault tolerant writer for timeline v2), i presume client need not ensure much as consistency will be handled in server side and IMO it would be sufficent to just have non blocking call for async # Is it important to maintain the order of events which are sent from sync and async ? i.e. Is it req to ensure all the async events are also pushed along with the current sync event or is it ok to send only the sync ? (current patch just ensures async events are in order) . # Whether its req to merge entities of multiple async calls as they belong to same application ? Please kindly review and share your thoughts on the above points. cc /[~sjlee0] Informing you, as you had asked to include you in discussion for these points and also you were not watching for this jira > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Attachments: YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity
[ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731875#comment-14731875 ] Sunil G commented on YARN-4091: --- Thank you [~leftnoteasy] for the detailed information shared. From your input and also synced with [~rohithsharma] and [~nijel] offline, I am trying to summarize a view point for this. Very raw information is mentioned for now in REST response in example, we ll add detailed information later. *Adding more diagnostics and debug information to Scheduler will help the user to get two levels of knowledge. So If we fetch this information with 2 REST api calls, specific reason for potential problem in scheduler can be identified and action can be taken* *1*. What happened to an application recently in Scheduler (like status from node heartbeats) *Example*: - application might not have got containers it asked Reason: Userlimit for the application has reached - application might still be in pending state, yet to get active. Reason: Am resource limit is exhausted, hence app cant be made active *Benefit for user with this info*: User will get to know the clear problem area to look for along with potential reason for it. *How User can get this info*: Via REST api, debug/diagnostic information can be fetched for a queue/application. *Expected O/P*: {noformat} queue - a: application : app1 appState : RUNNING reasonPhrase : NA lastContainerAssignmentState : SKIPPED_ASSIGNMENT reasonPhrase : Userlimit quota is reached application : app2 appState : ACCEPTED reasonPhrase : AM resource limit exhausted {noformat} *2*. Data/Metrics information from scheduler which is particular to the problem identified in 1. *Example*: - User can fetch metrics information via REST such as the current queue cap, user limit configured, user limit calculated within scheduler etc. - User can fetch metrics information via REST such as queue capacity, am resource % configured, am resource % calculated within RM, current demand etc. This two level information will help user to take correct measure in cluster to fix the problem, such as increase priority of app, OR change queue of an application, OR kill some containers in node manually OR some auto tuning from AM also. > Improvement: Introduce more debug/diagnostics information to detail out > scheduler activity > -- > > Key: YARN-4091 > URL: https://issues.apache.org/jira/browse/YARN-4091 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager >Affects Versions: 2.7.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Improvement on debugdiagnostic information - YARN.pdf > > > As schedulers are improved with various new capabilities, more configurations > which tunes the schedulers starts to take actions such as limit assigning > containers to an application, or introduce delay to allocate container etc. > There are no clear information passed down from scheduler to outerworld under > these various scenarios. This makes debugging very tougher. > This ticket is an effort to introduce more defined states on various parts in > scheduler where it skips/rejects container assignment, activate application > etc. Such information will help user to know whats happening in scheduler. > Attaching a short proposal for initial discussion. We would like to improve > on this as we discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)