[jira] [Created] (YARN-4332) UI timestamps are unconditionally rendered in browser timezone
Jason Lowe created YARN-4332: Summary: UI timestamps are unconditionally rendered in browser timezone Key: YARN-4332 URL: https://issues.apache.org/jira/browse/YARN-4332 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Jason Lowe Timestamps are being rendered in the browser local timezone which makes it hard to line up with events in task logfiles when the cluster isn't in the same timezone as the browser. This either needs to be restored to UTC time or at least be configurable whether this behavior is desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834
[ https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990399#comment-14990399 ] Karthik Kambatla commented on YARN-4032: [~jianhe]'s suggestion makes sense to me. Maybe do the following: {code} if (app-recovery-fails) { if (previous attempt is FINISHED) { skip this application } else if (fail-fast is false) { fail application } else { crash RM } } {code} > Corrupted state from a previous version can still cause RM to fail with NPE > due to same reasons as YARN-2834 > > > Key: YARN-4032 > URL: https://issues.apache.org/jira/browse/YARN-4032 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Anubhav Dhoot >Priority: Critical > Attachments: YARN-4032.prelim.patch > > > YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if > someone is upgrading from a previous version, the state can still be > inconsistent and then RM will still fail with NPE after upgrade to 2.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-4330: -- Assignee: Varun Saxena > MiniYARNCluster prints multiple Failed to instantiate default resource > calculator warning messages > --- > > Key: YARN-4330 > URL: https://issues.apache.org/jira/browse/YARN-4330 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Affects Versions: 2.8.0 > Environment: OSX, JUnit >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Blocker > > Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I > see multiple stack traces warning me that a resource calculator plugin could > not be created > {code} > (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - > java.lang.UnsupportedOperationException: Could not determine OS: Failed to > instantiate default resource calculator. > java.lang.UnsupportedOperationException: Could not determine OS > {code} > This is a minicluster. It doesn't need resource calculation. It certainly > doesn't need test logs being cluttered with even more stack traces which will > only generate false alarms about tests failing. > There needs to be a way to turn this off, and the minicluster should have it > that way by default. > Being ruthless and marking as a blocker, because its a fairly major > regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4314) Adding container wait time as a metric at queue level and application level.
[ https://issues.apache.org/jira/browse/YARN-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990346#comment-14990346 ] Karthik Kambatla commented on YARN-4314: bq. I feel adding timestamp to each resource request will be costly and all the existing applications will need to migrate to use this metric. I was not suggesting the AM set it. It might not be a bad idea to let the AMs set it optionally. I was thinking the RM could set this on receiving a ResourceRequest, and use it to determine duration. > Adding container wait time as a metric at queue level and application level. > > > Key: YARN-4314 > URL: https://issues.apache.org/jira/browse/YARN-4314 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Lavkesh Lahngir >Assignee: Lavkesh Lahngir > > There is a need for adding the container wait-time which can be tracked at > the queue and application level. > An application can have two kinds of wait times. One is AM wait time after > submission and another is total container wait time between AM asking for > containers and getting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990565#comment-14990565 ] Jason Lowe commented on YARN-4311: -- Thanks for the patch, Kuhu! Test failures are related, please look into them. In general the patch seems like a reasonable approach. There needs to be some way for admins to remove nodes that are no longer relevant to the cluster, and AFAIK there's no supported way to do this short of restarting the resourcemanager. As nodes churn in and out of the cluster, they will simply accumulate in the decommissioned or lost nodes buckets until the next resourcemanager restart. My main concern is the behavior when someone botches the include list (e.g.: accidentally truncates the includes list file and refreshes). At that point all of the cluster nodes will disappear from the resourcemanager with no indication of what happened (except potentially the shutdown metric will increment by the number of nodes lost). Today they will all go into the decommissioned bucket, but with this patch they'll simply disappear. This either needs to be "as designed" behavior, or we'd have to implement a separate mechanism outside of the include/exclude lists to direct the RM to "forget" a node. I believe HDFS recently was changed to behave this was as well wrt. the include/exclude lists and forgetting nodes (see HDFS-8950), so I'm inclined to be consistent with that and say it's "as designed." Some comments on the patch itself: isInvalidAndAbsent doesn't have the same handling of IP's as isValidNode does. It might also be clearer if isInvalidAndAbsent were just named isUntracked or isUntrackedNode indicating those are nodes we aren't tracking in any way. isInvalidAndAbsent doesn't lock hostsReader like isValidNode does. What about refreshNodesGracefully? That also refreshes the host include/exclude lists and arguably needs similar logic. We need to discuss what it means to gracefully refresh the list when the node completely disappears from both the include and exclude list. Should it still gracefully decommission, and how do we make sure that node is properly tracked? If graceful, does it automatically disappear when the decommission completes since it's not in either list? Nit: While looping over the nodes, if the node is valid then there's no reason to check if its not valid and absent. So it could be simplified to the following: {code} for (NodeId nodeId: rmContext.getRMNodes().keySet()) { if (!isValidNode(nodeId.getHost())) { RMNodeEventType nodeEventType = isInvalidAndAbsent(nodeId.getHost()) ? RMNodeEventType.SHUTDOWN : RMNodeEventType.DECOMMISSION; this.rmContext.getDispatcher().getEventHandler().handle( new RMNodeEvent(nodeId, nodeEventType)); } } {code} > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990603#comment-14990603 ] Jason Lowe commented on YARN-570: - So this changed timestamps to be rendered unconditionally in local time? That's unfortunate. See [~aw]'s comment in https://issues.apache.org/jira/browse/YARN-2348?focusedCommentId=14073218=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14073218. Unfortunately local timezone isn't always the right thing to do because the timestamps of the log files for tasks will _not_ be the local timezone of the browser when running jobs in a distant colo. So this change makes it worse for users that are lining up events based on what they see in the UI with what they see in the logs. This minimally should have been configurable. Filed YARN-4332 to either revert this change or make it configurable. > Time strings are formated in different timezone > --- > > Key: YARN-570 > URL: https://issues.apache.org/jira/browse/YARN-570 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.2.0 >Reporter: Peng Zhang >Assignee: Akira AJISAKA > Fix For: 2.7.0 > > Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, > YARN-570.3.patch, YARN-570.4.patch, YARN-570.5.patch > > > Time strings on different page are displayed in different timezone. > If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as > "Wed, 10 Apr 2013 08:29:56 GMT" > If it is formatted by format() in yarn.util.Times, it appears as "10-Apr-2013 > 16:29:56" > Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2885) LocalRM: distributed scheduling decisions for queueable containers
[ https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh reassigned YARN-2885: - Assignee: Arun Suresh > LocalRM: distributed scheduling decisions for queueable containers > -- > > Key: YARN-2885 > URL: https://issues.apache.org/jira/browse/YARN-2885 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > > We propose to add a Local ResourceManager (LocalRM) to the NM in order to > support distributed scheduling decisions. > Architecturally we leverage the RMProxy, introduced in YARN-2884. > The LocalRM makes distributed decisions for queuable containers requests. > Guaranteed-start requests are still handled by the central RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2882) Introducing container types
[ https://issues.apache.org/jira/browse/YARN-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990881#comment-14990881 ] Arun Suresh commented on YARN-2882: --- Thanks for the patch [~kkaranasos]. The patch looks mostly good. Few minor nits : # I feel that instead of adding another *newInstance* method in *ResourceRequest* class, maybe we replace this with some sort of builder pattern. for eg : something like so : {noformat} ReseourceRequest req = new ResourceRequestBuilder().setPripority(pri).setHostName(hostname).setContainerType(QUEUEABLE)...build(); {noformat} (I understand.. this might impact other parts of the code, but I believe it would make it more extensible in the future.) # in the *yarn_protos.proto* file, can we add *container_type* after the *node_label_expression* field (I feel newer fields should come later) Also, looks like the patch does not apply correctly anymore, can you please rebase ? > Introducing container types > --- > > Key: YARN-2882 > URL: https://issues.apache.org/jira/browse/YARN-2882 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: yarn-2882.patch > > > This JIRA introduces the notion of container types. > We propose two initial types of containers: guaranteed-start and queueable > containers. > Guaranteed-start are the existing containers, which are allocated by the > central RM and are instantaneously started, once allocated. > Queueable is a new type of container, which allows containers to be queued in > the NM, thus their execution may be arbitrarily delayed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990702#comment-14990702 ] Hadoop QA commented on YARN-3223: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 38s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 24s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 45s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 27s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 25s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 25s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 56s {color} | {color:red} Patch generated 1 new checkstyle issues in root (total was 242, now 242). {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 0s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 50s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 22s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 53s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 151m 51s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_60 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_79 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-04 | | JIRA Patch URL |
[jira] [Created] (YARN-4333) Fair scheduler should support preemption within queue
Tao Jie created YARN-4333: - Summary: Fair scheduler should support preemption within queue Key: YARN-4333 URL: https://issues.apache.org/jira/browse/YARN-4333 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Tao Jie Now each app in fair scheduler is allocated its fairshare, however fairshare resource is not ensured even if fairSharePreemption is enabled. Consider: 1, When the cluster is idle, we submit app1 to queueA,which takes maxResource of queueA. 2, Then the cluster becomes busy, but app1 does not release any resource, queueA resource usage is over its fairshare 3, Then we submit app2(maybe with higher priority) to queueA. Now app2 has its own fairshare, but could not obtain any resource, since queueA is still over its fairshare and resource will not assign to queueA anymore. Also, preemption is not triggered in this case. So we should allow preemption within queue, when app is starved for fairshare. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991069#comment-14991069 ] Bikas Saha commented on YARN-2047: -- >From the description it seems like the original scope was making sure that a >lost NM's containers are marked expired by the RM even across RM restart. For >that, wont it be enough to save a dead/decommissioned NM info in the state >store. Upon restart, repopulate the decommissioned/dead status from the state >store. It can take appropriate action at that time - e.g. cancelling an AM >containers for those NMs when the AM re-registers or asking those NMs to >restart and re-register if they heartbeat again. If this is a required action then it would also imply that saving a such nodes would be a critical state change operation. So, e.g. decommission command from the admin should not complete until the store has been updated. Is that the case? > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API
[ https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991253#comment-14991253 ] Sunil G commented on YARN-4292: --- Test case failures seems unrelated. > ResourceUtilization should be a part of NodeInfo REST API > - > > Key: YARN-4292 > URL: https://issues.apache.org/jira/browse/YARN-4292 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Sunil G > Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, > 0003-YARN-4292.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2934) Improve handling of container's stderr
[ https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991260#comment-14991260 ] Naganarasimha G R commented on YARN-2934: - Can one of the watchers, please take a look at the patch ? > Improve handling of container's stderr > --- > > Key: YARN-2934 > URL: https://issues.apache.org/jira/browse/YARN-2934 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Gera Shegalov >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, > YARN-2934.v1.003.patch > > > Most YARN applications redirect stderr to some file. That's why when > container launch fails with {{ExitCodeException}} the message is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-3980: -- Attachment: YARN-3980-v4.patch Adding utilizaiton to FIFO and Fair scheduler. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990284#comment-14990284 ] Inigo Goiri commented on YARN-3980: --- I added the info to the FIFO and Fair schedulers. I'm rerunning the checks because I couldn't figure the errors in the Javadoc and the report is out. Regarding the test, we will add a unit test with the mini cluster. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API
[ https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990210#comment-14990210 ] Hadoop QA commented on YARN-4292: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 0s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 35s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 51s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 40s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 34s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 34s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 4s {color} | {color:red} Patch generated 14 new checkstyle issues in root (total was 127, now 141). {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 41s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 50s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 3s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 53s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 154m 24s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_60 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_79 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-04 | | JIRA Patch URL |
[jira] [Updated] (YARN-4331) Restarting NodeManager leaves orphaned containers
[ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4331: - Summary: Restarting NodeManager leaves orphaned containers (was: Killing NodeManager leaves orphaned containers) Note that the killing of the nodemanager itself with SIGKILL should not cause the containers to be killed in itself. Instead the problem seems to be that when the nodemanager restarts it is either failing to reacquire the containers that were running or it reacquires them and the RM fails to tell the NM to kill them when it re-registers. Updating the summary accordingly. Also by "the AM and its container" I assume you mean the application master and some other container that the AM launched. Please correct me if I'm wrong. Is work-preserving nodemanager restart enabled on this cluster? Without it nodemanagers cannot track containers that were previously running, so it will not be able to reacquire them and kill them. If they don't exit on their own then they will "leak" and continue running outside of YARN's knowledge. If that feature is not enabled on the nodemanager then this behavior is expected, since killing it with SIGKILL gave the nodemanager no chance to perform any container cleanup on its own. If restart is enabled on the nodemanager then this behavior could be correct if the application running told the RM that containers should not be killed when AM attempts fail. In that case the container should be left running and its up to the AM to reacquire it via some means. (I believe the RM does provide a bit of help there in the AM-RM protocol.) If the containers were supposed to be killed when the AM attempt failed then we need to figure out which of the two possibilities above is the problem. Could you look in the NM logs and see if it said it was able to reacquire the previously running containers before it was killed? If it didn't then we need to figure out why, and log snippets around the restart/recovery would be a big help. If it did reacquire the containers and register to the RM with those containers then apparently the RM didn't tell the NM to kill the undesired containers. In that case the log from the RM side around the time the NM re-registered would be helpful. > Restarting NodeManager leaves orphaned containers > - > > Key: YARN-4331 > URL: https://issues.apache.org/jira/browse/YARN-4331 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.7.1 >Reporter: Joseph >Priority: Critical > > We are seeing a lot of orphaned containers running in our production clusters. > I tried to simulate this locally on my machine and can replicate the issue by > killing nodemanager. > I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza > jobs. > Steps: > {quote}1. Deploy a job > 2. Issue a kill -9 signal to nodemanager > 3. We should see the AM and its container running without nodemanager > 4. AM should die but the container still keeps running > 5. Restarting nodemanager brings up new AM and container but leaves the > orphaned container running in the background > {quote} > This is effectively causing double processing of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1510) Make NMClient support change container resources
[ https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989637#comment-14989637 ] MENG DING commented on YARN-1510: - I just ran these tests locally with latest trunk and YARN-1510 applied, and they all passed: {code} --- T E S T S --- --- T E S T S --- Running org.apache.hadoop.yarn.client.TestGetGroups Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.886 sec - in org.apache.hadoop.yarn.client.TestGetGroups Running org.apache.hadoop.yarn.client.api.impl.TestYarnClient Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.187 sec - in org.apache.hadoop.yarn.client.api.impl.TestYarnClient Results : Tests run: 28, Failures: 0, Errors: 0, Skipped: 0 {code} > Make NMClient support change container resources > > > Key: YARN-1510 > URL: https://issues.apache.org/jira/browse/YARN-1510 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1510-YARN-1197.1.patch, > YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, > YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch > > > As described in YARN-1197, YARN-1449, we need add API in NMClient to support > 1) sending request of increase/decrease container resource limits > 2) get succeeded/failed changed containers response from NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1510) Make NMClient support change container resources
[ https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989648#comment-14989648 ] MENG DING commented on YARN-1510: - Also ran the following tests, they passed: {code} --- T E S T S --- Running org.apache.hadoop.yarn.client.api.impl.TestAMRMClient Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 51.842 sec - in org.apache.hadoop.yarn.client.api.impl.TestAMRMClient Running org.apache.hadoop.yarn.client.api.impl.TestNMClient Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.733 sec - in org.apache.hadoop.yarn.client.api.impl.TestNMClient Results : Tests run: 12, Failures: 0, Errors: 0, Skipped: 0 {code} > Make NMClient support change container resources > > > Key: YARN-1510 > URL: https://issues.apache.org/jira/browse/YARN-1510 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1510-YARN-1197.1.patch, > YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, > YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch > > > As described in YARN-1197, YARN-1449, we need add API in NMClient to support > 1) sending request of increase/decrease container resource limits > 2) get succeeded/failed changed containers response from NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989664#comment-14989664 ] Karthik Kambatla commented on YARN-3980: Thanks Inigo. Looks generally good. Comments: # javadoc error in SchedulerNode # Can we make the necessary changes on FairScheduler and FifoScheduler as well? It should be similar and straight-forward. Not sure if there is a simple way to test this. Can we leverage the SLS to verify the node utilization passed in the heartbeat shows up in the scheduler? If not, I am comfortable with checking this in without a test. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989697#comment-14989697 ] Kuhu Shukla commented on YARN-4311: --- [~jlowe], [~leftnoteasy], request for comments. Thanks a lot. > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989812#comment-14989812 ] Jun Gong commented on YARN-2047: For case 1, RM could save dead NMs in StateStore, when these NM registers with containers, RM could let NM kill these containers. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API
[ https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4292: -- Attachment: 0003-YARN-4292.patch Thank You [~leftnoteasy] Yes, its better to have a separate class for the resourceUtilization details. Kindly help to check the update patch. > ResourceUtilization should be a part of NodeInfo REST API > - > > Key: YARN-4292 > URL: https://issues.apache.org/jira/browse/YARN-4292 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Sunil G > Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, > 0003-YARN-4292.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989769#comment-14989769 ] Jun Gong commented on YARN-2047: I think we could list cases which will cause the problem in the issue: 1. When RM restarts, NM stops and could not restart(e.g. the server is down forever). To deal with this case, RM might need save information about NMs and their containers, it might not be acceptable as discussed in YARN-3161. 2. NM stops; after some time, RM1 regards it as dead and complete containers on it; RM1 stops and RM2 becomes active RM. Then NM restarts. Those containers will become live again when NM registers them with RM2. This case is more often than the above case. And we need to solve it. How about solving the problem in the NM side? My proposal: adding a timestamp in NMStateStore, and update it regularly. When NM restarts, it checks current time and last updated timestamp, it could know whether it has been regarded as dead in RM, and kills contains if it has been regarded as dead. If the proposal in case 2 is OK, I could attach a patch. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990047#comment-14990047 ] Steve Loughran commented on YARN-4330: -- Looks like YARN-3534 triggered this. Full Stack: note the sheer number of repeated traces {code} Projects/slider/slider-core/target/teststandalonerest/teststandalonerest-logDir-nm-0_0 2015-11-04 17:49:31,322 [Thread-2] INFO server.MiniYARNCluster (MiniYARNCluster.java:serviceInit(540)) - Starting NM: 0 2015-11-04 17:49:31,383 [Thread-2] INFO nodemanager.NodeManager (NodeManager.java:getNodeHealthScriptRunner(255)) - Node Manager health check script is not available or doesn't have execute permission, so not starting the node health script runner. 2015-11-04 17:49:31,469 [Thread-2] WARN util.ResourceCalculatorPlugin (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - java.lang.UnsupportedOperationException: Could not determine OS: Failed to instantiate default resource calculator. java.lang.UnsupportedOperationException: Could not determine OS at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.(ResourceCalculatorPlugin.java:41) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:182) at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl.serviceInit(NodeResourceMonitorImpl.java:73) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:356) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:541) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:273) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.Service$init.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120) at org.apache.slider.test.YarnMiniClusterTestBase.createMiniCluster(YarnMiniClusterTestBase.groovy:291) at org.apache.slider.test.YarnZKMiniClusterTestBase.createMiniCluster(YarnZKMiniClusterTestBase.groovy:110) at org.apache.slider.test.YarnZKMiniClusterTestBase.createMiniCluster(YarnZKMiniClusterTestBase.groovy:127) at org.apache.slider.agent.rest.TestStandaloneREST.testStandaloneREST(TestStandaloneREST.groovy:52) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) 2015-11-04 17:49:31,472 [Thread-2] INFO nodemanager.NodeResourceMonitorImpl (NodeResourceMonitorImpl.java:serviceInit(76)) - Using ResourceCalculatorPlugin : null 2015-11-04 17:49:31,475 [Thread-2] INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:serviceInit(261)) - AMRMProxyService is disabled 2015-11-04 17:49:31,475 [Thread-2] INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:validateConf(224)) - per directory file limit = 8192 2015-11-04 17:49:31,549 [Thread-2] WARN util.ResourceCalculatorPlugin (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - java.lang.UnsupportedOperationException: Could not determine OS: Failed to instantiate default resource calculator. java.lang.UnsupportedOperationException: Could not determine OS
[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990051#comment-14990051 ] Steve Loughran commented on YARN-4330: -- As well as having a way to turn this feature off for miniclusters, the code trying to instantiate the resource calculator should recognise the falure and fallback, rather than retry. Retrying isn't going to fix this. > MiniYARNCluster prints multiple Failed to instantiate default resource > calculator warning messages > --- > > Key: YARN-4330 > URL: https://issues.apache.org/jira/browse/YARN-4330 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Affects Versions: 2.8.0 > Environment: OSX, JUnit >Reporter: Steve Loughran >Priority: Blocker > > Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I > see multiple stack traces warning me that a resource calculator plugin could > not be created > {code} > (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - > java.lang.UnsupportedOperationException: Could not determine OS: Failed to > instantiate default resource calculator. > java.lang.UnsupportedOperationException: Could not determine OS > {code} > This is a minicluster. It doesn't need resource calculation. It certainly > doesn't need test logs being cluttered with even more stack traces which will > only generate false alarms about tests failing. > There needs to be a way to turn this off, and the minicluster should have it > that way by default. > Being ruthless and marking as a blocker, because its a fairly major > regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4331) Killing NodeManager leaves orphaned containers
Joseph created YARN-4331: Summary: Killing NodeManager leaves orphaned containers Key: YARN-4331 URL: https://issues.apache.org/jira/browse/YARN-4331 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Affects Versions: 2.7.1 Reporter: Joseph Priority: Critical We are seeing a lot of orphaned containers running in our production clusters. I tried to simulate this locally on my machine and can replicate the issue by killing nodemanager. I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs. Steps: 1. Deploy a job 2. Issue a kill -9 signal to nodemanager 3. We should see the AM and its container running without nodemanager 4. AM should die but the container still keeps running 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container running in the background This is effectively causing double processing of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
Steve Loughran created YARN-4330: Summary: MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages Key: YARN-4330 URL: https://issues.apache.org/jira/browse/YARN-4330 Project: Hadoop YARN Issue Type: Bug Components: test, yarn Affects Versions: 2.8.0 Environment: OSX, JUnit Reporter: Steve Loughran Priority: Blocker Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I see multiple stack traces warning me that a resource calculator plugin could not be created {code} (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - java.lang.UnsupportedOperationException: Could not determine OS: Failed to instantiate default resource calculator. java.lang.UnsupportedOperationException: Could not determine OS {code} This is a minicluster. It doesn't need resource calculation. It certainly doesn't need test logs being cluttered with even more stack traces which will only generate false alarms about tests failing. There needs to be a way to turn this off, and the minicluster should have it that way by default. Being ruthless and marking as a blocker, because its a fairly major regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v2.patch Updated patch based on feedback. Checkstyle errors about CapacityScheduler.java file length still there. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4331) Killing NodeManager leaves orphaned containers
[ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph updated YARN-4331: - Description: We are seeing a lot of orphaned containers running in our production clusters. I tried to simulate this locally on my machine and can replicate the issue by killing nodemanager. I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs. Steps: {quote}1. Deploy a job 2. Issue a kill -9 signal to nodemanager 3. We should see the AM and its container running without nodemanager 4. AM should die but the container still keeps running 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container running in the background {quote} This is effectively causing double processing of data. was: We are seeing a lot of orphaned containers running in our production clusters. I tried to simulate this locally on my machine and can replicate the issue by killing nodemanager. I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs. Steps: 1. Deploy a job 2. Issue a kill -9 signal to nodemanager 3. We should see the AM and its container running without nodemanager 4. AM should die but the container still keeps running 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container running in the background This is effectively causing double processing of data. > Killing NodeManager leaves orphaned containers > -- > > Key: YARN-4331 > URL: https://issues.apache.org/jira/browse/YARN-4331 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.7.1 >Reporter: Joseph >Priority: Critical > > We are seeing a lot of orphaned containers running in our production clusters. > I tried to simulate this locally on my machine and can replicate the issue by > killing nodemanager. > I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza > jobs. > Steps: > {quote}1. Deploy a job > 2. Issue a kill -9 signal to nodemanager > 3. We should see the AM and its container running without nodemanager > 4. AM should die but the container still keeps running > 5. Restarting nodemanager brings up new AM and container but leaves the > orphaned container running in the background > {quote} > This is effectively causing double processing of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989338#comment-14989338 ] Hadoop QA commented on YARN-3432: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 1s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 1s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 128m 42s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_60 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_79 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-04 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12737159/YARN-3432-002.patch | | JIRA Issue | YARN-3432 | | Optional Tests | asflicense javac javadoc mvninstall unit findbugs checkstyle compile | | uname | Linux 75c368b9f110 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3