[jira] [Updated] (YARN-3999) RM hangs on draining events
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3999: -- Attachment: YARN-3999-branch-2.6.1.txt Attaching patch that I committed to 2.6.1. > RM hangs on draining events > --- > > Key: YARN-3999 > URL: https://issues.apache.org/jira/browse/YARN-3999 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.2 > > Attachments: YARN-3999-branch-2.6.1.txt, YARN-3999-branch-2.7.patch, > YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, > YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch > > > If external systems like ATS, or ZK becomes very slow, draining all the > events take a lot of time. If this time becomes larger than 10 mins, all > applications will expire. Fixes include: > 1. add a timeout and stop the dispatcher even if not all events are drained. > 2. Move ATS service out from RM active service so that RM doesn't need to > wait for ATS to flush the events when transitioning to standby. > 3. Stop client-facing services (ClientRMService etc.) first so that clients > get fast notification that RM is stopping/transitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4047) ClientRMService getApplications has high scheduler lock contention
[ https://issues.apache.org/jira/browse/YARN-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-4047: -- Fix Version/s: 2.6.1 Pulled this into 2.6.1. The patch applies cleanly. Ran compilation before the push. > ClientRMService getApplications has high scheduler lock contention > -- > > Key: YARN-4047 > URL: https://issues.apache.org/jira/browse/YARN-4047 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jason Lowe >Assignee: Jason Lowe > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.2 > > Attachments: YARN-4047.001.patch > > > The getApplications call can be particuarly expensive because the code can > call checkAccess on every application being tracked by the RM. checkAccess > will often call scheduler.checkAccess which will grab the big scheduler lock. > This can cause a lot of contention with the scheduler thread which is busy > trying to process node heartbeats, app allocation requests, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3999) RM hangs on draining events
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3999: -- Fix Version/s: 2.6.1 Pulled this into 2.6.1. Had to fix a couple of minor merge conflicts. Dropped changes to TestAsyncDispatcher.java and TestRMAppLogAggregationStatus.java which don't exist in 2.6.1. Ran compilation and TestAppManager, TestResourceManager, TestRMAppTransitions, TestRMAppAttemptTransitions, TestUtils, TestFifoScheduler before the push. > RM hangs on draining events > --- > > Key: YARN-3999 > URL: https://issues.apache.org/jira/browse/YARN-3999 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.2 > > Attachments: YARN-3999-branch-2.7.patch, YARN-3999.1.patch, > YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, YARN-3999.4.patch, > YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch > > > If external systems like ATS, or ZK becomes very slow, draining all the > events take a lot of time. If this time becomes larger than 10 mins, all > applications will expire. Fixes include: > 1. add a timeout and stop the dispatcher even if not all events are drained. > 2. Move ATS service out from RM active service so that RM doesn't need to > wait for ATS to flush the events when transitioning to standby. > 3. Stop client-facing services (ClientRMService etc.) first so that clients > get fast notification that RM is stopping/transitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3978: -- Fix Version/s: 2.6.1 Pulled this into 2.6.1. The patch applies cleanly for the most part except for a couple of minor merge conflicts in test-cases which I fixed. Ran compilation and TestClientRMService, TestRMContainerImpl, TestChildQueueOrder, TestLeafQueue, TestReservations, TestFifoScheduler before the push. > Configurably turn off the saving of container info in Generic AHS > - > > Key: YARN-3978 > URL: https://issues.apache.org/jira/browse/YARN-3978 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver, yarn >Affects Versions: 2.8.0, 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne > Labels: 2.6.1-candidate > Fix For: 3.0.0, 2.6.1, 2.8.0, 2.7.2 > > Attachments: YARN-3978.001.patch, YARN-3978.002.patch, > YARN-3978.003.patch, YARN-3978.004.patch > > > Depending on how each application's metadata is stored, one week's worth of > data stored in the Generic Application History Server's database can grow to > be almost a terabyte of local disk space. In order to alleviate this, I > suggest that there is a need for a configuration option to turn off saving of > non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4022) queue not remove from webpage(/cluster/scheduler) when delete queue in xxx-scheduler.xml
[ https://issues.apache.org/jira/browse/YARN-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] forrestchen updated YARN-4022: -- Attachment: YARN-4022.003.patch Fix checkstyle > queue not remove from webpage(/cluster/scheduler) when delete queue in > xxx-scheduler.xml > > > Key: YARN-4022 > URL: https://issues.apache.org/jira/browse/YARN-4022 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: forrestchen > Labels: scheduler > Attachments: YARN-4022.001.patch, YARN-4022.002.patch, > YARN-4022.003.patch > > > When I delete an existing queue by modify the xxx-schedule.xml, I can still > see the queue information block in webpage(/cluster/scheduler) though the > 'Min Resources' items all become to zero and have no item of 'Max Running > Applications'. > I can still submit an application to the deleted queue and the application > will run using 'root.default' queue instead, but submit to an un-exist queue > will cause an exception. > My expectation is the deleted queue will not displayed in webpage and submit > application to the deleted queue will act just like the queue doesn't exist. > PS: There's no application running in the queue I delete. > Some related config in yarn-site.xml: > {code} > > yarn.scheduler.fair.user-as-default-queue > false > > > yarn.scheduler.fair.allow-undeclared-pools > false > > {code} > a related question is here: > http://stackoverflow.com/questions/26488564/hadoop-yarn-why-the-queue-cannot-be-deleted-after-i-revise-my-fair-scheduler-xm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2301: -- Attachment: YARN-2301-branch-2.6.1.txt Attaching patch that I committed to 2.6.1. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: 2.6.1-candidate, usability > Fix For: 2.7.0, 2.6.1 > > Attachments: YARN-2301-branch-2.6.1.txt, YARN-2301.01.patch, > YARN-2301.03.patch, YARN-2301.20141120-1.patch, YARN-2301.20141203-1.patch, > YARN-2301.20141204-1.patch, YARN-2303.patch > > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2301: -- Labels: 2.6.1-candidate usability (was: usability) Fix Version/s: 2.6.1 Pulled this into 2.6.1 as a dependency for YARN-3978. The patch applied cleanly, had to make minor change to the TestYarnCLI to make it work correctly on 2.6.1. Ran compilation and TestYarnCLI, TestClientRMService, TestRMContainerImpl before the push. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: 2.6.1-candidate, usability > Fix For: 2.7.0, 2.6.1 > > Attachments: YARN-2301.01.patch, YARN-2301.03.patch, > YARN-2301.20141120-1.patch, YARN-2301.20141203-1.patch, > YARN-2301.20141204-1.patch, YARN-2303.patch > > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736193#comment-14736193 ] Hadoop QA commented on YARN-4126: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 59s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 8m 6s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 12s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 50s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 23m 18s | Tests failed in hadoop-common. | | {color:red}-1{color} | yarn tests | 57m 0s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 125m 24s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.security.token.delegation.web.TestWebDelegationToken | | | hadoop.yarn.server.resourcemanager.TestClientRMService | | | hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens | | | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754796/0004-YARN-4126.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a153b96 | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9050/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9050/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9050/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9050/console | This message was automatically generated. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch, 0004-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736177#comment-14736177 ] Hadoop QA commented on YARN-4133: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 57s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 7s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 20s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 54s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 32s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 54m 34s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 94m 53s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754810/YARN-4133.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a153b96 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9052/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9052/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9052/console | This message was automatically generated. > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}. We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} > Also once the containers in {{warnedContainers}} are wrongly removed, it will > never be preempted. Because these containers are already in > {{FSAppAttempt#preemptionMap}} and {{FSAppAttempt#preemptContainer}} won't > return the containers in {{FSAppAttempt#preemptionMap}}. > {code} > public RMContainer preemptContainer() { > if (LOG.isDebugEnabled()) { > LOG.debug("App " + getName() + " is going to preempt a running " + > "container"); > } > RMContainer toBePreempted = null; > for (RMContainer container : getLiveContainers()) { > if (!getPreemptionContainers().contains(container) && > (toBePreempted == null || > comparator.compare(toBePreempted, container) > 0)) { >
[jira] [Updated] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4074: -- Attachment: YARN-4074-YARN-2928.POC.004.patch The v.4 POC patch posted. - added the XmlElement notation for flow runs in the flow activity entity - rebased against the v.5 patch for YARN-3901 - added more unit tests - made sure the id's are set correctly on flow run entities and flow activity entities > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, > YARN-4074-YARN-2928.POC.004.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM
[ https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736158#comment-14736158 ] Bibin A Chundatt commented on YARN-4106: Findbug report not showing anything looks like build report problem > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > --- > > Key: YARN-4106 > URL: https://issues.apache.org/jira/browse/YARN-4106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, > 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch, > 0006-YARN-4106.patch > > > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > Steps to reproduce > === > # Configure nodelabel in distributed mode > yarn.node-labels.configuration-type=distributed > provider = config > yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms > # Start RM the NM > # Once NM is registration is done add nodelabels in RM > Nodelabels not getting updated in RM side > *This jira also handles the below issue too* > Timer Task not getting triggered in Nodemanager for Label update in > nodemanager for distributed scheduling > Task is supposed to trigger every > {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736140#comment-14736140 ] Sangjin Lee commented on YARN-3901: --- Somehow the jenkins info didn't make it to the JIRA: https://builds.apache.org/job/PreCommit-YARN-Build/9044/ -1 overall | Vote | Subsystem | Runtime | Comment | -1 | pre-patch | 15m 55s | Findbugs (version ) appears to be | | || broken on YARN-2928. | +1 |@author | 0m 1s | The patch does not contain any | | || @author tags. | +1 | tests included | 0m 0s | The patch appears to include 2 new | | || or modified test files. | +1 | javac | 8m 6s | There were no new javac warning | | || messages. | +1 |javadoc | 10m 12s | There were no new javadoc warning | | || messages. | +1 | release audit | 0m 23s| The applied patch does not increase | | || the total number of release audit | | || warnings. | +1 | checkstyle | 0m 16s| There were no new checkstyle | | || issues. | -1 | whitespace | 0m 31s| The patch has 7 line(s) that end in | | || whitespace. Use git apply | | || --whitespace=fix. | +1 |install | 1m 34s| mvn install still works. | +1 |eclipse:eclipse | 0m 40s| The patch built with | | || eclipse:eclipse. | -1 | findbugs | 0m 56s| The patch appears to introduce 7 | | || new Findbugs (version 3.0.0) | | || warnings. | +1 | yarn tests | 1m 54s| Tests passed in | | || hadoop-yarn-server-timelineservice. | | | 40m 33s | Reason | Tests FindBugs | module:hadoop-yarn-server-timelineservice || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754731/YARN-3901-YARN-2928.5.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / e6afe26 | | whitespace | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/whitespace.txt | | Findbugs warnings | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-timelineservice.html | | hadoop-yarn-server-timelineservice test log | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9044/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all th
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736141#comment-14736141 ] Hadoop QA commented on YARN-4131: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 21m 58s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 54s | The applied patch generated 4 new checkstyle issues (total was 32, now 36). | | {color:red}-1{color} | whitespace | 0m 13s | The patch has 4 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 7m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | mapreduce tests | 118m 56s | Tests failed in hadoop-mapreduce-client-jobclient. | | {color:green}+1{color} | yarn tests | 0m 29s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 6m 55s | Tests failed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 2m 4s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 36s | Tests passed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 54m 15s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 243m 43s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.mapred.TestMRIntermediateDataEncryption | | | hadoop.yarn.client.api.impl.TestYarnClient | | | hadoop.yarn.client.cli.TestYarnCLI | | Timed out tests | org.apache.hadoop.mapreduce.lib.jobcontrol.TestMapReduceJobControl | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754772/YARN-4131-v1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d9c1fab | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/whitespace.txt | | hadoop-mapreduce-client-jobclient test log | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9047/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9047/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9047/console | This message was automatically generated. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch, > YARN-4131-v1.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4120) FSAppAttempt.getResourceUsage() should not take preemptedResource into account
[ https://issues.apache.org/jira/browse/YARN-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736130#comment-14736130 ] Xianyin Xin commented on YARN-4120: --- Create YARN-4134 to track it. > FSAppAttempt.getResourceUsage() should not take preemptedResource into account > -- > > Key: YARN-4120 > URL: https://issues.apache.org/jira/browse/YARN-4120 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Xianyin Xin > > When compute resource usage for Schedulables, the following code is envolved, > {{FSAppAttempt.getResourceUsage}}, > {code} > public Resource getResourceUsage() { > return Resources.subtract(getCurrentConsumption(), getPreemptedResources()); > } > {code} > and this value is aggregated to FSLeafQueues and FSParentQueues. In my > opinion, taking {{preemptedResource}} into account here is not reasonable, > there are two main reasons, > # it is something in future, i.e., even though these resources are marked as > preempted, it is currently used by app, and these resources will be > subtracted from {{currentCosumption}} once the preemption is finished. it's > not reasonable to make arrange for it ahead of time. > # there's another problem here, consider following case, > {code} > root >/\ > queue1 queue2 > /\ > queue1.3, queue1.4 > {code} > suppose queue1.3 need resource and it can preempt resources from queue1.4, > the preemption happens in the interior of queue1. But when compute resource > usage of queue1, {{queue1.resourceUsage = it's_current_resource_usage - > preemption}} according to the current code, which is unfair to queue2 when > doing resource allocating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4134) FairScheduler preemption stops at queue level that all child queues are not over their fairshare
Xianyin Xin created YARN-4134: - Summary: FairScheduler preemption stops at queue level that all child queues are not over their fairshare Key: YARN-4134 URL: https://issues.apache.org/jira/browse/YARN-4134 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Xianyin Xin Now FairScheudler uses a choose-a-candidate method to select a container from leaf queues that to be preempted, in {{FSParentQueue.preemptContainer()}}, {code} readLock.lock(); try { for (FSQueue queue : childQueues) { if (candidateQueue == null || comparator.compare(queue, candidateQueue) > 0) { candidateQueue = queue; } } } finally { readLock.unlock(); } // Let the selected queue choose which of its container to preempt if (candidateQueue != null) { toBePreempted = candidateQueue.preemptContainer(); } {code} a candidate child queue is selected. However, if the queue's usage isn't over it's fairshare, preemption will not happen: {code} if (!preemptContainerPreCheck()) { return toBePreempted; } {code} A scenario: {code} root /\ queue1 queue2 /\ queue1.3, ( queue1.4 ) {code} suppose there're 8 containers, and queues at any level have the same weight. queue1.3 takes 4 and queue2 takes 4, so both queue1 and queue2 are at their fairshare. Now we submit an app in queue1.4 with 4 containers needs, it should preempt 2 from queue1.3, but the candidate-containers selection procedure will stop at level that all of the child queues are not over their fairshare, and none of the containers will be preempted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM
[ https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736112#comment-14736112 ] Hadoop QA commented on YARN-4106: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 37s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 4s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 36s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 15s | The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 7m 36s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 34s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-nodemanager | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754800/0006-YARN-4106.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a153b96 | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9051/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9051/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9051/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9051/console | This message was automatically generated. > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > --- > > Key: YARN-4106 > URL: https://issues.apache.org/jira/browse/YARN-4106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, > 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch, > 0006-YARN-4106.patch > > > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > Steps to reproduce > === > # Configure nodelabel in distributed mode > yarn.node-labels.configuration-type=distributed > provider = config > yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms > # Start RM the NM > # Once NM is registration is done add nodelabels in RM > Nodelabels not getting updated in RM side > *This jira also handles the below issue too* > Timer Task not getting triggered in Nodemanager for Label update in > nodemanager for distributed scheduling > Task is supposed to trigger every > {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4120) FSAppAttempt.getResourceUsage() should not take preemptedResource into account
[ https://issues.apache.org/jira/browse/YARN-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736100#comment-14736100 ] Karthik Kambatla commented on YARN-4120: That is also a valid concern. Can we track it in a separate JIRA? The preemption logic definitely needs revisiting. YARN-2154 is a starting point. [~asuresh] and I have been considering significant logic changes to better accommodate both preemption and future features like node-labeling, but haven't found the time to write it up and post here. > FSAppAttempt.getResourceUsage() should not take preemptedResource into account > -- > > Key: YARN-4120 > URL: https://issues.apache.org/jira/browse/YARN-4120 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Xianyin Xin > > When compute resource usage for Schedulables, the following code is envolved, > {{FSAppAttempt.getResourceUsage}}, > {code} > public Resource getResourceUsage() { > return Resources.subtract(getCurrentConsumption(), getPreemptedResources()); > } > {code} > and this value is aggregated to FSLeafQueues and FSParentQueues. In my > opinion, taking {{preemptedResource}} into account here is not reasonable, > there are two main reasons, > # it is something in future, i.e., even though these resources are marked as > preempted, it is currently used by app, and these resources will be > subtracted from {{currentCosumption}} once the preemption is finished. it's > not reasonable to make arrange for it ahead of time. > # there's another problem here, consider following case, > {code} > root >/\ > queue1 queue2 > /\ > queue1.3, queue1.4 > {code} > suppose queue1.3 need resource and it can preempt resources from queue1.4, > the preemption happens in the interior of queue1. But when compute resource > usage of queue1, {{queue1.resourceUsage = it's_current_resource_usage - > preemption}} according to the current code, which is unfair to queue2 when > doing resource allocating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4133: Attachment: YARN-4133.000.patch > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}. We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} > Also once the containers in {{warnedContainers}} are wrongly removed, it will > never be preempted. Because these containers are already in > {{FSAppAttempt#preemptionMap}} and {{FSAppAttempt#preemptContainer}} won't > return the containers in {{FSAppAttempt#preemptionMap}}. > {code} > public RMContainer preemptContainer() { > if (LOG.isDebugEnabled()) { > LOG.debug("App " + getName() + " is going to preempt a running " + > "container"); > } > RMContainer toBePreempted = null; > for (RMContainer container : getLiveContainers()) { > if (!getPreemptionContainers().contains(container) && > (toBePreempted == null || > comparator.compare(toBePreempted, container) > 0)) { > toBePreempted = container; > } > } > return toBePreempted; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4133: Attachment: (was: YARN-4133.000.patch) > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}. We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} > Also once the containers in {{warnedContainers}} are wrongly removed, it will > never be preempted. Because these containers are already in > {{FSAppAttempt#preemptionMap}} and {{FSAppAttempt#preemptContainer}} won't > return the containers in {{FSAppAttempt#preemptionMap}}. > {code} > public RMContainer preemptContainer() { > if (LOG.isDebugEnabled()) { > LOG.debug("App " + getName() + " is going to preempt a running " + > "container"); > } > RMContainer toBePreempted = null; > for (RMContainer container : getLiveContainers()) { > if (!getPreemptionContainers().contains(container) && > (toBePreempted == null || > comparator.compare(toBePreempted, container) > 0)) { > toBePreempted = container; > } > } > return toBePreempted; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java
[ https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736068#comment-14736068 ] Xianyin Xin commented on YARN-4090: --- Hi [~leftnoteasy], [~kasha], would you please take a look? since this change has relation with preemption, so link it with YARN-4120. > Make Collections.sort() more efficient in FSParentQueue.java > > > Key: YARN-4090 > URL: https://issues.apache.org/jira/browse/YARN-4090 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Xianyin Xin >Assignee: Xianyin Xin > Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, > sampling1.jpg, sampling2.jpg > > > Collections.sort() consumes too much time in a scheduling round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736052#comment-14736052 ] Hadoop QA commented on YARN-4133: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 41s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 52s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 3s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 25s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 29s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 54m 10s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 92m 6s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754780/YARN-4133.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d9c1fab | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9049/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9049/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9049/console | This message was automatically generated. > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}. We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} > Also once the containers in {{warnedContainers}} are wrongly removed, it will > never be preempted. Because these containers are already in > {{FSAppAttempt#preemptionMap}} and {{FSAppAttempt#preemptContainer}} won't > return the containers in {{FSAppAttempt#preemptionMap}}. > {code} > public RMContainer preemptContainer() { > if (LOG.isDebugEnabled()) { > LOG.debug("App " + getName() + " is going to preempt a running " + > "container"); > } > RMContainer toBePreempted = null; > for (RMContainer container : getLiveContainers()) { > if (!getPreemptio
[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM
[ https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736025#comment-14736025 ] Bibin A Chundatt commented on YARN-4106: Hi [~leftnoteasy] Thnks for comments. Updates patch uploaded > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > --- > > Key: YARN-4106 > URL: https://issues.apache.org/jira/browse/YARN-4106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, > 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch, > 0006-YARN-4106.patch > > > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > Steps to reproduce > === > # Configure nodelabel in distributed mode > yarn.node-labels.configuration-type=distributed > provider = config > yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms > # Start RM the NM > # Once NM is registration is done add nodelabels in RM > Nodelabels not getting updated in RM side > *This jira also handles the below issue too* > Timer Task not getting triggered in Nodemanager for Label update in > nodemanager for distributed scheduling > Task is supposed to trigger every > {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM
[ https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4106: --- Attachment: 0006-YARN-4106.patch > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > --- > > Key: YARN-4106 > URL: https://issues.apache.org/jira/browse/YARN-4106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, > 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch, > 0006-YARN-4106.patch > > > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > Steps to reproduce > === > # Configure nodelabel in distributed mode > yarn.node-labels.configuration-type=distributed > provider = config > yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms > # Start RM the NM > # Once NM is registration is done add nodelabels in RM > Nodelabels not getting updated in RM side > *This jira also handles the below issue too* > Timer Task not getting triggered in Nodemanager for Label update in > nodemanager for distributed scheduling > Task is supposed to trigger every > {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4126: --- Attachment: 0004-YARN-4126.patch > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch, 0004-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735987#comment-14735987 ] Xianyin Xin commented on YARN-4133: --- Of course we can also address these problems one by one in different jiras. If you like this, just kindly ignore the above comment. > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}. We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} > Also once the containers in {{warnedContainers}} are wrongly removed, it will > never be preempted. Because these containers are already in > {{FSAppAttempt#preemptionMap}} and {{FSAppAttempt#preemptContainer}} won't > return the containers in {{FSAppAttempt#preemptionMap}}. > {code} > public RMContainer preemptContainer() { > if (LOG.isDebugEnabled()) { > LOG.debug("App " + getName() + " is going to preempt a running " + > "container"); > } > RMContainer toBePreempted = null; > for (RMContainer container : getLiveContainers()) { > if (!getPreemptionContainers().contains(container) && > (toBePreempted == null || > comparator.compare(toBePreempted, container) > 0)) { > toBePreempted = container; > } > } > return toBePreempted; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4133: Description: Containers to be preempted leaks in FairScheduler preemption logic. It may cause missing preemption due to containers in {{warnedContainers}} wrongly removed. The problem is in {{preemptResources}}: There are two issues which can cause containers wrongly removed from {{warnedContainers}}: Firstly missing the container state {{RMContainerState.ACQUIRED}} in the condition check: {code} (container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) {code} Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we shouldn't remove container from {{warnedContainers}}. We should only remove container from {{warnedContainers}}, if container is not in state {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and {{RMContainerState.ACQUIRED}}. {code} if ((container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) && isResourceGreaterThanNone(toPreempt)) { warnOrKillContainer(container); Resources.subtractFrom(toPreempt, container.getContainer().getResource()); } else { warnedIter.remove(); } {code} Also once the containers in {{warnedContainers}} are wrongly removed, it will never be preempted. Because these containers are already in {{FSAppAttempt#preemptionMap}} and {{FSAppAttempt#preemptContainer}} won't return the containers in {{FSAppAttempt#preemptionMap}}. {code} public RMContainer preemptContainer() { if (LOG.isDebugEnabled()) { LOG.debug("App " + getName() + " is going to preempt a running " + "container"); } RMContainer toBePreempted = null; for (RMContainer container : getLiveContainers()) { if (!getPreemptionContainers().contains(container) && (toBePreempted == null || comparator.compare(toBePreempted, container) > 0)) { toBePreempted = container; } } return toBePreempted; } {code} was: Containers to be preempted leaks in FairScheduler preemption logic. It may cause missing preemption due to containers in {{warnedContainers}} wrongly removed. The problem is in {{preemptResources}}: There are two issues which can cause containers wrongly removed from {{warnedContainers}}: Firstly missing the container state {{RMContainerState.ACQUIRED}} in the condition check: {code} (container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) {code} Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we shouldn't remove container from {{warnedContainers}}, We should only remove container from {{warnedContainers}}, if container is not in state {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and {{RMContainerState.ACQUIRED}}. {code} if ((container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) && isResourceGreaterThanNone(toPreempt)) { warnOrKillContainer(container); Resources.subtractFrom(toPreempt, container.getContainer().getResource()); } else { warnedIter.remove(); } {code} > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}. We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKill
[jira] [Commented] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735959#comment-14735959 ] Xianyin Xin commented on YARN-4133: --- Hi [~zxu], it seems the current preemption logic has many problems. I just updated one in [https://issues.apache.org/jira/browse/YARN-4120?focusedCommentId=14735952&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14735952]. I think a logic refactor is need, what do you think? > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}, We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4120) FSAppAttempt.getResourceUsage() should not take preemptedResource into account
[ https://issues.apache.org/jira/browse/YARN-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735952#comment-14735952 ] Xianyin Xin commented on YARN-4120: --- Hi [~kasha], there's another issue in the current preemption logic, it's in {{FSParentQueue.java}} and {{FSLeafQueue.java}}, {code} public RMContainer preemptContainer() { RMContainer toBePreempted = null; // Find the childQueue which is most over fair share FSQueue candidateQueue = null; Comparator comparator = policy.getComparator(); readLock.lock(); try { for (FSQueue queue : childQueues) { if (candidateQueue == null || comparator.compare(queue, candidateQueue) > 0) { candidateQueue = queue; } } } finally { readLock.unlock(); } // Let the selected queue choose which of its container to preempt if (candidateQueue != null) { toBePreempted = candidateQueue.preemptContainer(); } return toBePreempted; } {code} {code} public RMContainer preemptContainer() { RMContainer toBePreempted = null; // If this queue is not over its fair share, reject if (!preemptContainerPreCheck()) { return toBePreempted; } {code} If the queue's hierarchy like that in the *Description*, suppose queue1 and queue2 have the same weight, and the cluster has 8 containers, 4 occupied by queue1.1 and 4 occupied by queue2. If new app was added in queue1.2, 2 containers should be preempted from queue1.1. However, according the above code, queue1 and queue2 are both at their fairshare, so the preemption will not happen. So if all of the childqueues at any level are at their fairshare, preemption will not happen even though there is/are resource deficit in some leafqueues. I think we have to drop this logic in this case. As a candidate, we can calculates an ideal preemption distribution by traversing the queues. Any thoughts? > FSAppAttempt.getResourceUsage() should not take preemptedResource into account > -- > > Key: YARN-4120 > URL: https://issues.apache.org/jira/browse/YARN-4120 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Xianyin Xin > > When compute resource usage for Schedulables, the following code is envolved, > {{FSAppAttempt.getResourceUsage}}, > {code} > public Resource getResourceUsage() { > return Resources.subtract(getCurrentConsumption(), getPreemptedResources()); > } > {code} > and this value is aggregated to FSLeafQueues and FSParentQueues. In my > opinion, taking {{preemptedResource}} into account here is not reasonable, > there are two main reasons, > # it is something in future, i.e., even though these resources are marked as > preempted, it is currently used by app, and these resources will be > subtracted from {{currentCosumption}} once the preemption is finished. it's > not reasonable to make arrange for it ahead of time. > # there's another problem here, consider following case, > {code} > root >/\ > queue1 queue2 > /\ > queue1.3, queue1.4 > {code} > suppose queue1.3 need resource and it can preempt resources from queue1.4, > the preemption happens in the interior of queue1. But when compute resource > usage of queue1, {{queue1.resourceUsage = it's_current_resource_usage - > preemption}} according to the current code, which is unfair to queue2 when > doing resource allocating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
[ https://issues.apache.org/jira/browse/YARN-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4133: Attachment: YARN-4133.000.patch > Containers to be preempted leaks in FairScheduler preemption logic. > --- > > Key: YARN-4133 > URL: https://issues.apache.org/jira/browse/YARN-4133 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4133.000.patch > > > Containers to be preempted leaks in FairScheduler preemption logic. It may > cause missing preemption due to containers in {{warnedContainers}} wrongly > removed. The problem is in {{preemptResources}}: > There are two issues which can cause containers wrongly removed from > {{warnedContainers}}: > Firstly missing the container state {{RMContainerState.ACQUIRED}} in the > condition check: > {code} > (container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) > {code} > Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we > shouldn't remove container from {{warnedContainers}}, We should only remove > container from {{warnedContainers}}, if container is not in state > {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and > {{RMContainerState.ACQUIRED}}. > {code} > if ((container.getState() == RMContainerState.RUNNING || > container.getState() == RMContainerState.ALLOCATED) && > isResourceGreaterThanNone(toPreempt)) { > warnOrKillContainer(container); > Resources.subtractFrom(toPreempt, > container.getContainer().getResource()); > } else { > warnedIter.remove(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4086) Allow Aggregated Log readers to handle HAR files
[ https://issues.apache.org/jira/browse/YARN-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735947#comment-14735947 ] Hadoop QA commented on YARN-4086: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 7s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 24s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 31s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 55s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 2m 2s | Tests passed in hadoop-yarn-common. | | | | 51m 4s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754773/YARN-4086.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d9c1fab | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9048/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9048/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9048/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9048/console | This message was automatically generated. > Allow Aggregated Log readers to handle HAR files > > > Key: YARN-4086 > URL: https://issues.apache.org/jira/browse/YARN-4086 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4086.001.patch, YARN-4086.002.patch > > > This is for the YARN changes for MAPREDUCE-6415. It allows the yarn CLI and > web UIs to read aggregated logs from HAR files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735900#comment-14735900 ] Joep Rottinghuis commented on YARN-3901: The one remaining issue we have to tackle is when there are two app attempts. The previous app attempt ends up buffering some writes, and the new app attempt ends up writing a final_value. Now if the flush happens before the first attempt its write comes in, we no longer have the unaggregated value for that app_id in order to discard against (the timestamp should have taken care of this order). We can deal with this issue in three ways: 1) Ignore (risky and very hard to debug if it ever happens) 2) Keep the final value around until it has aged a certain time. Upside is that the value is initially kept (for for example 1-2 days?) and then later discarded. Downside is that we won't collapse values as quickly on flush as we can. The collapse would probably happen when a compaction happens, possibly only when a major compaction happens. But previous unaggregated values may have been written to disk anyway, so not sure how much of an issue this really is. 3) keep a list of the last x app_ids (aggregation compaction dimension values) on the aggregated flow-level data. What we would then do in the aggregator is to go through all the values as we currently do. We'd collapse all the values to keep only the latest per flow. Before we sum an item for the flow, we'd compare if the app_id was in the list of most recent x (10) apps that were completed and collapsed. Pro is that with a lower app completion rate in a flow, we'd be guarded against stale writes for longer than a fixed time period. We'd still limit the size of extra storage in tags to a list of x (10?) items. Downside is that if apps complete in very rapid succession, we would potentially be protected from stale writes from an app for a shorter period of time. Given that there is a correlation between an app completion and its previous run, this may not be a huge factor. It's not like random previous app attempts are launched. This is really to cover the case when a new app attempt is launched, but the previous writer had some buffered writes that somehow still got through. I'm sort of tempted towards 2, since that is the most similar to the existing TTL functionality, and probably the easiest to code and understand. Simply compact only after a certain time period has passed. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#633
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735887#comment-14735887 ] Joep Rottinghuis commented on YARN-3901: Thanks [~vrushalic]. I'm going to dig through the details on the latest patch. Separately [~sjlee0] and I further discussed the challenges of taking the timestamp on the coprocessor, buffering writes, app restarts, timestamp collisions and ordering of various writes that come on. 1) Given that we have timestamps in # millis, then multiplying by 1,000 should suffice. It is unlikely that we'd have > 1M writes for one column in one region server for one flow. If we multiply by 1M we get close to the total date range that can fit in a long (still years to come, but still). 2) If we do any shifting of time, we should do the same everywhere to keep things consistent, and to keep the ability to ask what a particular row (roughly) looked like at any particular time (like last night midnight, what was the state of this entire row). 3) We think in the column helper, if the ATS client supplies a timestamp, we should multiply by 1,000. If we read any timestamp from HBase, we'll divide by 1,000. 4) If the ATS client doesn't supply the timestamp, we'll grab the timestamp in the ats writer the moment the write arrives (and before it is batched / buffered in the buffered mutator, HBase client, or RS queue). We then take this time and multiply by 1,000. Reads again divide by 1,000 to get back to millis in epoch as before. 5) For Agg operation SUM, MIN, and MAX we take the least significant 3 digits of the app_id and add this to the (timestamp*1000), so that we create a unique timestamp per app in an active flow-run. This should avoid any collisions. This takes care of uniqueness (no collisions on a single ms), but also solves for older instances of a writer (in case of a second AM attempt for example) or any other kind of ordering issue. The write are timestamped when they arrive at the writer. 6) If some piece of client code doesn't set any timestamp (this should be an error) then we cannot effectively order the writes as per the previous point. We still need to ensure that we don't have collisions. If the client supplied timestamp if LONG.Maxvalue, then we can generate the timestamp in the coprocessor on the servers side, modulo the counter to ensure uniqueness. We should still multiply by 1K to make the same amount of space for the unique counter. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian J
[jira] [Updated] (YARN-4086) Allow Aggregated Log readers to handle HAR files
[ https://issues.apache.org/jira/browse/YARN-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-4086: Attachment: YARN-4086.002.patch The 002 patch makes that test less brittle. I also fixed the RAT and checkstyle warnings. The test failure was because test-patch couldn't handle the binary part of the patch. > Allow Aggregated Log readers to handle HAR files > > > Key: YARN-4086 > URL: https://issues.apache.org/jira/browse/YARN-4086 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4086.001.patch, YARN-4086.002.patch > > > This is for the YARN changes for MAPREDUCE-6415. It allows the yarn CLI and > web UIs to read aggregated logs from HAR files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4131: - Attachment: YARN-4131-v1.patch Update patch with following updates: 1. Add ContainerKilledType in KillContainerRequest to indicate container will be killed as preempted or expired (failed). 2. Add async call in YarnClient per Steve's above comments 3. Add more unit tests with fixing build failures. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch, > YARN-4131-v1.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3999) RM hangs on draining events
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3999: -- Labels: 2.6.1-candidate (was: ) Adding to 2.6.1 from Jian's comment in the mailing list that I missed before. > RM hangs on draining events > --- > > Key: YARN-3999 > URL: https://issues.apache.org/jira/browse/YARN-3999 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Labels: 2.6.1-candidate > Fix For: 2.7.2 > > Attachments: YARN-3999-branch-2.7.patch, YARN-3999.1.patch, > YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, YARN-3999.4.patch, > YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch > > > If external systems like ATS, or ZK becomes very slow, draining all the > events take a lot of time. If this time becomes larger than 10 mins, all > applications will expire. Fixes include: > 1. add a timeout and stop the dispatcher even if not all events are drained. > 2. Move ATS service out from RM active service so that RM doesn't need to > wait for ATS to flush the events when transitioning to standby. > 3. Stop client-facing services (ClientRMService etc.) first so that clients > get fast notification that RM is stopping/transitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
zhihai xu created YARN-4133: --- Summary: Containers to be preempted leaks in FairScheduler preemption logic. Key: YARN-4133 URL: https://issues.apache.org/jira/browse/YARN-4133 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Containers to be preempted leaks in FairScheduler preemption logic. It may cause missing preemption due to containers in {{warnedContainers}} wrongly removed. The problem is in {{preemptResources}}: There are two issues which can cause containers wrongly removed from {{warnedContainers}}: Firstly missing the container state {{RMContainerState.ACQUIRED}} in the condition check: {code} (container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) {code} Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we shouldn't remove container from {{warnedContainers}}, We should only remove container from {{warnedContainers}}, if container is not in state {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and {{RMContainerState.ACQUIRED}}. {code} if ((container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) && isResourceGreaterThanNone(toPreempt)) { warnOrKillContainer(container); Resources.subtractFrom(toPreempt, container.getContainer().getResource()); } else { warnedIter.remove(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.
[ https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735859#comment-14735859 ] Hadoop QA commented on YARN-1651: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 2s | Findbugs (version ) appears to be broken on YARN-1197. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 20 new or modified test files. | | {color:red}-1{color} | javac | 8m 10s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 10m 17s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 55s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 31m 2s | The patch has 163 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 5m 29s | The patch appears to introduce 7 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 9m 26s | Tests passed in hadoop-mapreduce-client-app. | | {color:green}+1{color} | tools/hadoop tests | 0m 53s | Tests passed in hadoop-sls. | | {color:green}+1{color} | yarn tests | 6m 58s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-server-common. | | {color:red}-1{color} | yarn tests | 59m 24s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 154m 43s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-common | | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations | | | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754736/YARN-1651-4.YARN-1197.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | YARN-1197 / f86eae1 | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/diffJavacWarnings.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/whitespace.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html | | hadoop-mapreduce-client-app test log | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt | | hadoop-sls test log | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/testrun_hadoop-sls.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9045/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9045/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9045/console | This message was automatically generated. > CapacityScheduler side changes to support increase/decrease container > resource. > --- > > Key: YARN-1651 > URL: https://issues.apache.org/jira/browse/YARN-1651 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-1651-1.YARN-1197.patch, > YARN-1651-2.YARN-1197.patch, YARN-1651-3.YARN-1197.patch, > YARN-1651-4.YARN-1197.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735837#comment-14735837 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #345 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/345/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/CHANGES.txt > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.7.2 > > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-4126: -- Comment: was deleted (was: yes, oozie has fixed its own. This is just YARN side fix.) > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735791#comment-14735791 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2284 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2284/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.7.2 > > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735786#comment-14735786 ] Jian He commented on YARN-4126: --- yes, oozie has fixed its own. This is just YARN side fix. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735787#comment-14735787 ] Jian He commented on YARN-4126: --- yes, oozie has fixed its own. This is just YARN side fix. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735766#comment-14735766 ] Hadoop QA commented on YARN-2410: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 59s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 7s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 21s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 44s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 0m 19s | Tests passed in hadoop-mapreduce-client-shuffle. | | | | 37m 47s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754746/YARN-2410-v7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d9c1fab | | hadoop-mapreduce-client-shuffle test log | https://builds.apache.org/job/PreCommit-YARN-Build/9046/artifact/patchprocess/testrun_hadoop-mapreduce-client-shuffle.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9046/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9046/console | This message was automatically generated. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, > YARN-2410-v6.patch, YARN-2410-v7.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
[ https://issues.apache.org/jira/browse/YARN-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3985: Component/s: (was: fairscheduler) (was: capacityscheduler) > Make ReservationSystem persist state using RMStateStore reservation APIs > - > > Key: YARN-3985 > URL: https://issues.apache.org/jira/browse/YARN-3985 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > YARN-3736 adds the RMStateStore apis to store and load reservation state. > This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735714#comment-14735714 ] Hadoop QA commented on YARN-4126: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 25s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 54s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 57s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 50s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 2s | Tests passed in hadoop-common. | | {color:red}-1{color} | yarn tests | 53m 35s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 120m 41s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens | | | hadoop.yarn.server.resourcemanager.TestClientRMService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754713/0003-YARN-4126.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 16b9037 | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9042/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9042/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9042/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9042/console | This message was automatically generated. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735704#comment-14735704 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2307 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2307/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.7.2 > > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v7.patch Modified ShuffleHandler to not use channel attachments. Moved MockNetty code to a helper method. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, > YARN-2410-v6.patch, YARN-2410-v7.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735683#comment-14735683 ] Chang Li commented on YARN-4132: [~jlowe] please help review the latest patch. Thanks! > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735677#comment-14735677 ] Hadoop QA commented on YARN-4132: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 15s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 52s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 1s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 52s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 29s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 20s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 55s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 56m 39s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754730/YARN-4132.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d9c1fab | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9043/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9043/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9043/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9043/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9043/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9043/console | This message was automatically generated. > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.
[ https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735676#comment-14735676 ] MENG DING commented on YARN-1651: - Hi, [~leftnoteasy] bq. I agree the general idea, and we should do the similar thing. However, I'm not sure caching in RM is a good idea, potentially a malicious AM can send millions of unknown-to-be-decreased-containers to RM when RM started. Maybe it's better to cache in AMRMClient side. I think we can do this in a separated JIRA? Could you file a new JIRA for this if you agree? Your proposal makes sense. I will file a JIRA for this. Thanks for addressing my comments. I don't have more comments for now. As per our discussion, I will come up with an end-to-end test based on distributedshell, and post onto this JIRA for review. > CapacityScheduler side changes to support increase/decrease container > resource. > --- > > Key: YARN-1651 > URL: https://issues.apache.org/jira/browse/YARN-1651 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-1651-1.YARN-1197.patch, > YARN-1651-2.YARN-1197.patch, YARN-1651-3.YARN-1197.patch, > YARN-1651-4.YARN-1197.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735672#comment-14735672 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #357 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/357/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/CHANGES.txt > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.7.2 > > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.
[ https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1651: - Attachment: YARN-1651-4.YARN-1197.patch Thanks comments! [~mding]. bq. I only mention this because pullNewlyAllocatedContainers() has a check for null for the same logic, so I think we may want to make it consistent? Yes you're correct, updated code, thanks. bq. So, based on my understanding, if an application has reserved some resource for a container resource increase request on a node, that amount of resource should never be unreserved in order for the application to allocate a regular container on some other node. But that doesn't seem to be the case right now? Can you confirm? Done, now added check to {{getNodeIdToUnreserve}}, will check if a container is a increase reservation before cancel it. bq. I think it will be desirable to implement a pendingDecrease set in SchedulerApplicationAttempt, and corresponding logic, just like SchedulerApplicationAttempt.pendingRelease. This is to guard against the situation when decrease requests are received while RM is in the middle of recovery, and has not received all container statuses from NM yet. I agree the general idea, and we should do the similar thing. However, I'm not sure caching in RM is a good idea, potentially a malicious AM can send millions of unknown-to-be-decreased-containers to RM when RM started. Maybe it's better to cache in AMRMClient side. I think we can do this in a separated JIRA? Could you file a new JIRA for this if you agree? bq. Some nits... Addressed. Uploaded ver.4 patch. > CapacityScheduler side changes to support increase/decrease container > resource. > --- > > Key: YARN-1651 > URL: https://issues.apache.org/jira/browse/YARN-1651 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-1651-1.YARN-1197.patch, > YARN-1651-2.YARN-1197.patch, YARN-1651-3.YARN-1197.patch, > YARN-1651-4.YARN-1197.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-3901: - Attachment: YARN-3901-YARN-2928.5.patch Uploading patch v5 that incorporates Sangjin's review suggestions. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4132: --- Attachment: YARN-4132.2.patch fixed broken test in TestYarnConfigurationFields. The other broken tests are not related to my changes(seem to be caused by network problem on testing platform). Those tests all pass on my .2 patch on my local machine. > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.2.patch, YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735545#comment-14735545 ] Sangjin Lee commented on YARN-4074: --- It'd be great if you could take a look at the latest patch and let me know your feedback. Thanks! > [timeline reader] implement support for querying for flows and flow runs > > > Key: YARN-4074 > URL: https://issues.apache.org/jira/browse/YARN-4074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4074-YARN-2928.POC.001.patch, > YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch > > > Implement support for querying for flows and flow runs. > We should be able to query for the most recent N flows, etc. > This includes changes to the {{TimelineReader}} API if necessary, as well as > implementation of the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs
[ https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735541#comment-14735541 ] Sangjin Lee commented on YARN-4075: --- Sorry [~varun_saxena], it took me a while to review this. The patch looks good for the most part. FYI, I incorporated the XmlElement annotation for flow runs in {{FlowActivityEntity}} in YARN-4074. This change will be in the next patch (once I rebase with Vrushali's latest for YARN-3091). I also implemented the full {{compareTo()}} method already in the current patch for YARN-4074. > [reader REST API] implement support for querying for flows and flow runs > > > Key: YARN-4075 > URL: https://issues.apache.org/jira/browse/YARN-4075 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-4075-YARN-2928.POC.1.patch > > > We need to be able to query for flows and flow runs via REST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735533#comment-14735533 ] Hadoop QA commented on YARN-4132: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 56s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 59s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 23s | The applied patch generated 3 new checkstyle issues (total was 211, now 213). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 48s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 0m 22s | Tests failed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 6m 52s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 49m 56s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.conf.TestYarnConfigurationFields | | | hadoop.yarn.server.nodemanager.TestNodeStatusUpdater | | | hadoop.yarn.server.nodemanager.TestNodeManagerShutdown | | | hadoop.yarn.server.nodemanager.containermanager.TestNMProxy | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754710/YARN-4132.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 970daaa | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9041/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9041/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9041/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9041/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9041/console | This message was automatically generated. > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735537#comment-14735537 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1095 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1095/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.7.2 > > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735497#comment-14735497 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-trunk-Commit #8416 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8416/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.7.2 > > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735490#comment-14735490 ] zhihai xu commented on YARN-4096: - thanks Jason for the contribution! Committed it to branch-2.7.2, branch-2 and trunk. > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4096: Hadoop Flags: Reviewed > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735489#comment-14735489 ] Hudson commented on YARN-4096: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #364 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/364/]) YARN-4096. App local logs are leaked if log aggregation fails to initialize for the app. Contributed by Jason Lowe. (zxu: rev 16b9037dc1300b8bdbe54ba7cd47c53fe16e93d8) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735479#comment-14735479 ] Sangjin Lee commented on YARN-3901: --- I think something like the following would work: {code} 210 long currentMinValue = ((Number) GenericObjectMapper.read(CellUtil 211 .cloneValue(currentMinCell))).longValue(); 212 long currentCellValue = ((Number) GenericObjectMapper.read(CellUtil 213 .cloneValue(cell))).longValue(); {code} bq. I am thinking I will need this when the flush/compaction scanner is added in. If you'd like, I can move it in as a non-public class for now and then move it out if needed. +1. bq. I actually needed this in the unit test while checking the FlowActivityTable contents, if you want I can take it out and you can add that test case in when you add in the RowKey changes? If it is to help your unit test, it's fine to include it here (as long as it's identical to what we have in YARN-4074; that would help my rebasing). bq. Yeah, I was thinking about that too. Right now, metrics will get their own timestamps. For other columns, we'd be using the nanoseconds. I am trying to see if we can just use milliseconds. We do need the timestamps that are generated here to be in nanoseconds as they are multiplied by the factor of 1 million in {{TimestampGenerator}}. They cannot be converted to milliseconds, or it would defeat the purpose of using {{TimestampGenerator}}. The comment was about the concern of always being able to distinguish these two types of "timestamps" without confusion. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3635) Get-queue-mapping should be a common interface of YarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735471#comment-14735471 ] Hadoop QA commented on YARN-3635: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 40s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 8s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 14 new checkstyle issues (total was 236, now 242). | | {color:red}-1{color} | whitespace | 0m 3s | The patch has 15 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 29s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 54m 13s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 93m 49s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754296/YARN-3635.7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 970daaa | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9040/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9040/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9040/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9040/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9040/console | This message was automatically generated. > Get-queue-mapping should be a common interface of YarnScheduler > --- > > Key: YARN-3635 > URL: https://issues.apache.org/jira/browse/YARN-3635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Wangda Tan >Assignee: Tan, Wangda > Attachments: YARN-3635.1.patch, YARN-3635.2.patch, YARN-3635.3.patch, > YARN-3635.4.patch, YARN-3635.5.patch, YARN-3635.6.patch, YARN-3635.7.patch > > > Currently, both of fair/capacity scheduler support queue mapping, which makes > scheduler can change queue of an application after submitted to scheduler. > One issue of doing this in specific scheduler is: If the queue after mapping > has different maximum_allocation/default-node-label-expression of the > original queue, {{validateAndCreateResourceRequest}} in RMAppManager checks > the wrong queue. > I propose to make the queue mapping as a common interface of scheduler, and > RMAppManager set the queue after mapping before doing validations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4126: --- Attachment: 0003-YARN-4126.patch Hi [~jianhe] Attaching patch after testcase updation. {{TestRMWebServicesDelegationTokens}} havnt corrected yet. In nonsecure mode what should be the behaviour for {{RMWebServicesDelegationTokens}}. Currently it will be {{500 Internal Error}} > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM
[ https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated YARN-4132: --- Attachment: YARN-4132.patch > Nodemanagers should try harder to connect to the RM > --- > > Key: YARN-4132 > URL: https://issues.apache.org/jira/browse/YARN-4132 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4132.patch > > > Being part of the cluster, nodemanagers should try very hard (and possibly > never give up) to connect to a resourcemanager. Minimally we should have a > separate config to set how aggressively a nodemanager will connect to the RM > separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4132) Nodemanagers should try harder to connect to the RM
Chang Li created YARN-4132: -- Summary: Nodemanagers should try harder to connect to the RM Key: YARN-4132 URL: https://issues.apache.org/jira/browse/YARN-4132 Project: Hadoop YARN Issue Type: Bug Reporter: Chang Li Assignee: Chang Li Being part of the cluster, nodemanagers should try very hard (and possibly never give up) to connect to a resourcemanager. Minimally we should have a separate config to set how aggressively a nodemanager will connect to the RM separate from what clients will do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735452#comment-14735452 ] Vrushali C commented on YARN-3901: -- Thanks [~sjlee0] for the review! I will correct the variable ordering for static and private members as well as making variables final. bq. l.210: Strictly speaking, GenericObjectMapper will return an integer if the value fits within an integer; so it's not exactly a concern for min/max (timestamps) but for caution we might want to stay with Number instead of long Comparisons are not allowed for Number datatype. {code} The operator < is undefined for the argument type(s) java.lang.Number, java.lang.Number {code} So I would have to do something like {code} Number d = a.longValue() + b.longValue(); {code} Do you think this is better? bq. l.52: Is the TimestampGenerator class going to be used outside FlowRunCoprocessor? If not, I would argue that we should make it an inner class of FlowRunCoprocessor. At least we should make it non-public to keep it within the package. If it would see general use outside this class, then it might be better to make it a true public class in the common package. I suspect a non-public class might be what we want here. I am thinking I will need this when the flush/compaction scanner is added in. If you'd like, I can move it in as a non-public class for now and then move it out if needed. bq. It's up to you, but you could leave the row key improvement to YARN-4074. That might be easier to manage the changes between yours and mine. I'm restructuring all *RowKey classes uniformly. I actually needed this in the unit test while checking the FlowActivityTable contents, if you want I can take it out and you can add that test case in when you add in the RowKey changes? bq. l.144: This would mean that some cell timestamps would have the unit of the milliseconds and others would be in nanoseconds. I'm a little bit concerned if we ever interpret these timestamps incorrectly. Could there be a more explicit way of clearly differentiating them? I don't have good suggestions at the moment. Yeah, I was thinking about that too. Right now, metrics will get their own timestamps. For other columns, we'd be using the nanoseconds. I am trying to see if we can just use milliseconds. bq. it might be good to have short comments on what each method is testing I did try to make the unit test names themselves descriptive like testFlowActivityTable or testWriteFlowRunMinMaxToHBase or testWriteFlowRunMetricsOneFlow or testWriteFlowActivityOneFlow but I agree some more comments in the unit test will surely help. Will upload a new patch shortly, thanks! > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggreg
[jira] [Commented] (YARN-4096) App local logs are leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735448#comment-14735448 ] zhihai xu commented on YARN-4096: - +1. Committing it in. > App local logs are leaked if log aggregation fails to initialize for the app > > > Key: YARN-4096 > URL: https://issues.apache.org/jira/browse/YARN-4096 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-4096.001.patch > > > If log aggregation fails to initialize for an application then the local logs > will never be deleted. This is similar to YARN-3476 except this is a failure > when log aggregation tries to initialize the app-specific log aggregator > rather than a failure during a log upload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735384#comment-14735384 ] Sangjin Lee commented on YARN-3901: --- Thanks for the updated patch [~vrushalic]! I went over the new patch, and the following is the quick feedback. I'll also apply it with YARN-4074, and test it a little more. (HBaseTimelineWriterImpl.java) - l.141-155: the whole thing could be inside {{if (isApplication)...}} - l.264: this null check is not needed (FlowRunCoprocessor.java) - l.52: Is the {{TimestampGenerator}} class going to be used outside {{FlowRunCoprocessor}}? If not, I would argue that we should make it an inner class of {{FlowRunCoprocessor}}. At least we should make it non-public to keep it within the package. If it would see general use outside this class, then it might be better to make it a true public class in the common package. I suspect a non-public class might be what we want here. - l.52: let's make it final - l.54: style nit: I think the common style is to place the static variables before instance variables - Also, overall it seems we're using both the diamond operator (<>) and the old style generic declaration. It might be good to stick with one style (in which case the diamond operator might be better). - l.144: This would mean that some cell timestamps would have the unit of the milliseconds and others would be in nanoseconds. I'm a little bit concerned if we ever interpret these timestamps incorrectly. Could there be a more explicit way of clearly differentiating them? I don't have good suggestions at the moment. (FlowScanner.java) - variable ordering - l.210: Strictly speaking, {{GenericObjectMapper}} will return an integer if the value fits within an integer; so it's not exactly a concern for min/max (timestamps) but for caution we might want to stay with {{Number}} instead of long. (TimestampGenerator.java) - l.29: make it final - variable ordering - see above for the public/non-public comment (FlowActivityRowKey.java) - It's up to you, but you could leave the row key improvement to YARN-4074. That might be easier to manage the changes between yours and mine. I'm restructuring all *RowKey classes uniformly. (TestHBaseTimelineWriterImplFlowRun.java) - it might be good to have short comments on what each method is testing > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735349#comment-14735349 ] Junping Du commented on YARN-4131: -- bq. For exit codes, I'd like to be able to have the AM think that the container crashed, was pre-empted or went OOM, so we can test the different codepaths. Oh. I see. I think we can add an enum in KillContainerRequest and passed to RM. To keep it simple, CLI may be only support the only option (preempted)? > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4086) Allow Aggregated Log readers to handle HAR files
[ https://issues.apache.org/jira/browse/YARN-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735347#comment-14735347 ] Arun Suresh commented on YARN-4086: --- Thanks for the patch Robert. It Looks good save a minor nit : * In the 'testFetchApplictionLogsHar', when asserting the value of contents of the sysoutStream, you may want to just check if the output contains some impartant/relevant strings rather than matching the whole output, else the testcase would end up quite brittle, and requiring constant changes (especially if the output format changes) +1 post jenkins > Allow Aggregated Log readers to handle HAR files > > > Key: YARN-4086 > URL: https://issues.apache.org/jira/browse/YARN-4086 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4086.001.patch > > > This is for the YARN changes for MAPREDUCE-6415. It allows the yarn CLI and > web UIs to read aggregated logs from HAR files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735301#comment-14735301 ] Steve Loughran commented on YARN-4131: -- —I was thinking I need even less than what you'd done. I just want to kill a container and wait for the AM to react, or kill the AM and wait for it to restart: no need for synchronous operations. For exit codes, I'd like to be able to have the AM think that the container crashed, was pre-empted or went OOM, so we can test the different codepaths. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735278#comment-14735278 ] Hadoop QA commented on YARN-4131: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 21m 21s | Findbugs (version 3.0.0) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:red}-1{color} | javac | 4m 51s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754675/YARN-4131-demo-2.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / 970daaa | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9039/console | This message was automatically generated. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity
[ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735266#comment-14735266 ] Wangda Tan commented on YARN-4091: -- Thanks [~sunilg]. I can understand why you have this proposal, but I'm not sure if your approach works in following scenario. I feel getting a over-all state of an app and a last-container-assignment-state may not works well for them: - App wants only a small proportion of a cluster (such as hard locality) - Similar to above, app want to run on specific partition only - App's leafqueue or parent queue beyond its limit - App asks mappers in one partition (A), and reducers in another partition(B), when A has little available resource and B has more available resource. User wants to see why mappers allocation is slow. And also, we cannot get order of allocation with your approach, which is an important thing to look at when we enable fairness/priority scheduling for apps. > Improvement: Introduce more debug/diagnostics information to detail out > scheduler activity > -- > > Key: YARN-4091 > URL: https://issues.apache.org/jira/browse/YARN-4091 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, resourcemanager >Affects Versions: 2.7.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Improvement on debugdiagnostic information - YARN.pdf > > > As schedulers are improved with various new capabilities, more configurations > which tunes the schedulers starts to take actions such as limit assigning > containers to an application, or introduce delay to allocate container etc. > There are no clear information passed down from scheduler to outerworld under > these various scenarios. This makes debugging very tougher. > This ticket is an effort to introduce more defined states on various parts in > scheduler where it skips/rejects container assignment, activate application > etc. Such information will help user to know whats happening in scheduler. > Attaching a short proposal for initial discussion. We would like to improve > on this as we discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735252#comment-14735252 ] Allen Wittenauer commented on YARN-4126: That sounds like a bigger set of bugs than not issuing delegation tokens > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735238#comment-14735238 ] Junping Du commented on YARN-4131: -- bq. I'd actually leave out the waiting for the operation to complete: make it fully async and let caller wait if they want to. The logic in YarnClientImpl should demonstrate the async way of consuming this API. It basically call killContainer() and looping to check return code (true means container is still active, so kill event get sent). Anything else to do, like: put some explicit async tag on this API? bq. is there any way to set the exit code? I'd like to signal pre-emption and out of memory events at some point. Do you mean how we can know if container get killed successfully? Basically two ways, one is just like mentioned above, call killContainer() return false means container is gone; or call getContainerReport() or getContainers() in ApplicationBaseProtocol which return active containers only. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
[ https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735215#comment-14735215 ] Wangda Tan commented on YARN-4113: -- Created HADOOP-12386 to track RETRY_FOREVER changes. > RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER > -- > > Key: YARN-4113 > URL: https://issues.apache.org/jira/browse/YARN-4113 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Sunil G >Priority: Critical > > Found one issue in RMProxy how to initialize RetryPolicy: In > RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), > it uses RetryPolicies.RETRY_FOREVER which doesn't respect > {{yarn.resourcemanager.connect.retry-interval.ms}} setting. > RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test > without properly setup localhost name: > {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote > 14G DEBUG exception message to system before it dies. This will be very bad > if we do the same thing in a production cluster. > We should fix two places: > - Make RETRY_FOREVER can take retry-interval as constructor parameter. > - Respect retry-interval when we uses RETRY_FOREVER policy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4131: - Attachment: YARN-4131-demo-2.patch Sounds like we just involve a new class "MockResourceManagerFacade" on trunk that cause previous patch get build failure. demo-2 should fix it. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM
[ https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735207#comment-14735207 ] Wangda Tan commented on YARN-4106: -- Thanks update [~bibinchundatt], and comments from [~Naganarasimha]. Few minor comments: 1) Make failLabelResendInterval to final before we can configure it. 2) testConfigTimer sleep time is too much, I don't know if Clock can be used in Timer, I think you can set NM_NODE_LABELS_PROVIDER_FETCH_INTERVAL_MS to lower value, like 1000, and sleep 1500 ms. 3) With changes in your patch, testNodeLabelsFromConfig doesn't need sleep any more? > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > --- > > Key: YARN-4106 > URL: https://issues.apache.org/jira/browse/YARN-4106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, > 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch > > > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > Steps to reproduce > === > # Configure nodelabel in distributed mode > yarn.node-labels.configuration-type=distributed > provider = config > yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms > # Start RM the NM > # Once NM is registration is done add nodelabels in RM > Nodelabels not getting updated in RM side > *This jira also handles the below issue too* > Timer Task not getting triggered in Nodemanager for Label update in > nodemanager for distributed scheduling > Task is supposed to trigger every > {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735194#comment-14735194 ] Steve Loughran commented on YARN-4131: -- LGTM. # I'd actually leave out the waiting for the operation to complete: make it fully async and let caller wait if they want to # is there any way to set the exit code? I'd like to signal pre-emption and out of memory events at some point > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3337) Provide YARN chaos monkey
[ https://issues.apache.org/jira/browse/YARN-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735172#comment-14735172 ] Steve Loughran commented on YARN-3337: -- I'm happy with an operation that doesn't return whether or not the container gets killed...could also declare that it's potentially async and callers should poll for the container going away > Provide YARN chaos monkey > - > > Key: YARN-3337 > URL: https://issues.apache.org/jira/browse/YARN-3337 > Project: Hadoop YARN > Issue Type: New Feature > Components: test >Affects Versions: 2.7.0 >Reporter: Steve Loughran > > To test failure resilience today you either need custom scripts or implement > Chaos Monkey-like logic in your application (SLIDER-202). > Killing AMs and containers on a schedule & probability is the core activity > here, one that could be handled by a CLI App/client lib that does this. > # entry point to have a startup delay before acting > # frequency of chaos wakeup/polling > # probability to AM failure generation (0-100) > # probability of non-AM container kill > # future: other operations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735170#comment-14735170 ] Jian He commented on YARN-4126: --- Yes, there is. For example, oozie grabs this token in insecure mode and pass the token around in insecure mode which actually breaks in some places. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM
[ https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4106: - Priority: Major (was: Blocker) > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > --- > > Key: YARN-4106 > URL: https://issues.apache.org/jira/browse/YARN-4106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, > 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch > > > NodeLabels for NM in distributed mode is not updated even after > clusterNodelabel addition in RM > Steps to reproduce > === > # Configure nodelabel in distributed mode > yarn.node-labels.configuration-type=distributed > provider = config > yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms > # Start RM the NM > # Once NM is registration is done add nodelabels in RM > Nodelabels not getting updated in RM side > *This jira also handles the below issue too* > Timer Task not getting triggered in Nodemanager for Label update in > nodemanager for distributed scheduling > Task is supposed to trigger every > {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735132#comment-14735132 ] Allen Wittenauer commented on YARN-4126: Is there any actual harm in returning a useless delegation token? I know on the HDFS side of the house, returning null tokens has been extremely beneficial in streamlining the code. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735128#comment-14735128 ] Hadoop QA commented on YARN-3943: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 3s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 46s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 57s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 51s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 29s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 0s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 39s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 56m 14s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754660/YARN-3943.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 090d266 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9037/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9037/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9037/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9037/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9037/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9037/console | This message was automatically generated. > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735127#comment-14735127 ] Hadoop QA commented on YARN-4131: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 25s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:red}-1{color} | javac | 3m 5s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754664/YARN-4131-demo.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / 090d266 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9038/console | This message was automatically generated. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.
[ https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735070#comment-14735070 ] MENG DING commented on YARN-1651: - Hi, [~leftnoteasy] I am ok with most of the reply comments. Thanks. bq. It seems no need to do the null check here. When it becomes null? I prefer to keep it as-is and it will throw NPE if any fatal issue happens. The {{updateContainerAndNMToken}} may return null: {code} Container updatedContainer = updateContainerAndNMToken(rmContainer, false, increase); returnContainerList.add(updatedContainer); {code} I only mention this because {{pullNewlyAllocatedContainers()}} has a check for null for the same logic, so I think we may want to make it consistent? Some remaining comments: * As you mentioned in the code, currently reserved resource increase request does not participate in the continuous reservation looking logic. So, based on my understanding, if an application has reserved some resource for a container resource increase request on a node, that amount of resource should never be unreserved in order for the application to allocate a regular container on some other node. But that doesn't seem to be the case right now? Can you confirm? If so, I am thinking a simple solution would be to *exclude* resources reserved for increased containers when trying to find an unreserved container for regular container allocation. {code:title=RegularContainerAllocator.assignContainer} ... ... unreservedContainer = application.findNodeToUnreserve(clusterResource, node, priority, <= Don't consider resources reserved for container increase request resourceNeedToUnReserve); ... {code} * I think it will be desirable to implement a {{pendingDecrease}} set in {{SchedulerApplicationAttempt}}, and corresponding logic, just like {{SchedulerApplicationAttempt.pendingRelease}}. This is to guard against the situation *when decrease requests are received while RM is in the middle of recovery, and has not received all container statuses from NM yet*. * Some nits ** Comments in {{NMReportedContainerChangeIsDoneTransition}} doesn't seem right. ** IncreaseContainerAllocator: {{LOG.debug(" Headroom is satisifed, skip..");}} --> {{LOG.debug(" Headroom is not satisfied, skip..");}} > CapacityScheduler side changes to support increase/decrease container > resource. > --- > > Key: YARN-1651 > URL: https://issues.apache.org/jira/browse/YARN-1651 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-1651-1.YARN-1197.patch, > YARN-1651-2.YARN-1197.patch, YARN-1651-3.YARN-1197.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4131) Add API and CLI to kill container on given containerId
Junping Du created YARN-4131: Summary: Add API and CLI to kill container on given containerId Key: YARN-4131 URL: https://issues.apache.org/jira/browse/YARN-4131 Project: Hadoop YARN Issue Type: Sub-task Components: applications, client Reporter: Junping Du Assignee: Junping Du Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3337) Provide YARN chaos monkey
[ https://issues.apache.org/jira/browse/YARN-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735063#comment-14735063 ] Junping Du commented on YARN-3337: -- Put a demo patch on sub JIRA. [~ste...@apache.org], mind take a quick look if this is also something in your mind? I can do more polish work on that patch later. > Provide YARN chaos monkey > - > > Key: YARN-3337 > URL: https://issues.apache.org/jira/browse/YARN-3337 > Project: Hadoop YARN > Issue Type: New Feature > Components: test >Affects Versions: 2.7.0 >Reporter: Steve Loughran > > To test failure resilience today you either need custom scripts or implement > Chaos Monkey-like logic in your application (SLIDER-202). > Killing AMs and containers on a schedule & probability is the core activity > here, one that could be handled by a CLI App/client lib that does this. > # entry point to have a startup delay before acting > # frequency of chaos wakeup/polling > # probability to AM failure generation (0-100) > # probability of non-AM container kill > # future: other operations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4131) Add API and CLI to kill container on given containerId
[ https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4131: - Attachment: YARN-4131-demo.patch Attach a demo patch, more test work is still needed. > Add API and CLI to kill container on given containerId > -- > > Key: YARN-4131 > URL: https://issues.apache.org/jira/browse/YARN-4131 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, client >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4131-demo.patch > > > Per YARN-3337, we need a handy tools to kill container in some scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: (was: YARN-3943.000.patch) > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.000.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2884) Proxying all AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734967#comment-14734967 ] Kishore Chaliparambil commented on YARN-2884: - Thanks [~subru] > Proxying all AM-RM communications > - > > Key: YARN-2884 > URL: https://issues.apache.org/jira/browse/YARN-2884 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Carlo Curino >Assignee: Kishore Chaliparambil > Fix For: 2.8.0 > > Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, > YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, > YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, > YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, > YARN-2884-V8.patch, YARN-2884-V9.patch > > > We introduce the notion of an RMProxy, running on each node (or once per > rack). Upon start the AM is forced (via tokens and configuration) to direct > all its requests to a new services running on the NM that provide a proxy to > the central RM. > This give us a place to: > 1) perform distributed scheduling decisions > 2) throttling mis-behaving AMs > 3) mask the access to a federation of RMs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2884) Proxying all AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734966#comment-14734966 ] Kishore Chaliparambil commented on YARN-2884: - Thanks Jian! > Proxying all AM-RM communications > - > > Key: YARN-2884 > URL: https://issues.apache.org/jira/browse/YARN-2884 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Carlo Curino >Assignee: Kishore Chaliparambil > Fix For: 2.8.0 > > Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, > YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, > YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, > YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, > YARN-2884-V8.patch, YARN-2884-V9.patch > > > We introduce the notion of an RMProxy, running on each node (or once per > rack). Upon start the AM is forced (via tokens and configuration) to direct > all its requests to a new services running on the NM that provide a proxy to > the central RM. > This give us a place to: > 1) perform distributed scheduling decisions > 2) throttling mis-behaving AMs > 3) mask the access to a federation of RMs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734931#comment-14734931 ] Jason Lowe commented on YARN-2410: -- Thanks for updating the patch! bq. The only reason was findbugs which does not allow more than 7 parameters in a function call Normally a builder pattern is used to make the code more readable in those situations. However I don't think we need more than 7. ReduceContext really only needs mapIds, reduceId, channelCtx, user, infoMap, and outputBasePathStr. The other two parameters are either known to be zero (should not be passed) and can be derived from another (size of mapIds). As such we don't need SendMapOutputParams. bq. The reduceContext is a variable holds the value set by the setAttachment() method and is used by the getAttachment() answer. If I declare it in the test method, it needs be final which cannot be done due to it being used by the setter. createMockChannel can simply have a ReduceContext parameter, marked final, and that should solve that problem. But I thought we were getting rid of the use of channel attachments and just associating the context with the listener directly? Related to the last comment, we're still using channel attachments. sendMap can just take a ReduceContext parameter, and the listener can provide its context when calling it. No need for channel attachments. This can NPE since we're checking for null after we already use it: {noformat} +nextMap = sendMapOutput( +reduceContext.getSendMapOutputParams().getCtx(), +reduceContext.getSendMapOutputParams().getCtx().getChannel(), +reduceContext.getSendMapOutputParams().getUser(), mapId, +reduceContext.getSendMapOutputParams().getReduceId(), info); +nextMap.addListener(new ReduceMapFileCount(reduceContext)); +if (null == nextMap) { {noformat} maxSendMapCount should be cached during serviceInit like the other conf-derived settings so we aren't doing conf lookups on every shuffle. The indentation in sendMap isn't correct, as code is indented after a conditional block at the same level as the contents of the conditional block. There's other places that are over-indented. MockShuffleHandler only needs to override one thing, getShuffle, but the mock that method returns has to override a bunch of stuff. It makes more sense to create a separate class for the mocked Shuffle than the mocked ShuffleHandler. Should the mock Future stuff be part of creating the mocked channel? Can simply pass the listener list to use as an arg to the method that mocks up the channel. Nit: SHUFFLE_MAX_SEND_COUNT should probably be something like SHUFFLE_MAX_SESSION_OPEN_FILES to better match the property name. Similarly maxSendMapCount could have a more appropriate name. Nit: Format for 80 columns Nit: There's still instances where we have a class definition immediately after variable definitions and a lack of whitespace between classes and methods or between methods. Whitespace would help readability in those places. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3337) Provide YARN chaos monkey
[ https://issues.apache.org/jira/browse/YARN-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734800#comment-14734800 ] Junping Du commented on YARN-3337: -- I think there is one difficulty here: it looks like we didn't keep finished container info in RM scheduler info but only keep live containers info (in SchedulerApplicationAttempt). If no dead container info get preserved in RM, the new added API can only send kill container event but no way to know if container get killed actually (no way to differentiate a wrong container ID or an ID for finished container). CLI could be better as it can query running container list first, then kill it and wait container is not active. If we want exactly the same semantic as kill apps API, then we have to make RM to track info for dead containers which sounds too overkill to me as it force RM to track all containers for all applications (complexity become the same as MRv1). May be a better trade-off here is: the semantic for forceKillContainer() only means to send kill containers events but not means container get killed or not. A boolean value response for forceKillContainer() indicate if we found a live container to kill or not. So we could lose Idempotent property for this API? > Provide YARN chaos monkey > - > > Key: YARN-3337 > URL: https://issues.apache.org/jira/browse/YARN-3337 > Project: Hadoop YARN > Issue Type: New Feature > Components: test >Affects Versions: 2.7.0 >Reporter: Steve Loughran > > To test failure resilience today you either need custom scripts or implement > Chaos Monkey-like logic in your application (SLIDER-202). > Killing AMs and containers on a schedule & probability is the core activity > here, one that could be handled by a CLI App/client lib that does this. > # entry point to have a startup delay before acting > # frequency of chaos wakeup/polling > # probability to AM failure generation (0-100) > # probability of non-AM container kill > # future: other operations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3771) "final" behavior is not honored for YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH since it is a String[]
[ https://issues.apache.org/jira/browse/YARN-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734790#comment-14734790 ] Hadoop QA commented on YARN-3771: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 6s | Findbugs (version 3.0.0) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 58s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 13s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 15s | The applied patch generated 4 new checkstyle issues (total was 211, now 201). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 20s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 0m 46s | Tests passed in hadoop-mapreduce-client-common. | | {color:green}+1{color} | mapreduce tests | 107m 23s | Tests passed in hadoop-mapreduce-client-jobclient. | | {color:green}+1{color} | yarn tests | 0m 29s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 7m 4s | Tests passed in hadoop-yarn-applications-distributedshell. | | | | 162m 8s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12737924/0001-YARN-3771.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 435f935 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9033/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-mapreduce-client-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9033/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt | | hadoop-mapreduce-client-jobclient test log | https://builds.apache.org/job/PreCommit-YARN-Build/9033/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9033/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-applications-distributedshell test log | https://builds.apache.org/job/PreCommit-YARN-Build/9033/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9033/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9033/console | This message was automatically generated. > "final" behavior is not honored for > YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH since it is a String[] > > > Key: YARN-3771 > URL: https://issues.apache.org/jira/browse/YARN-3771 > Project: Hadoop YARN > Issue Type: Bug >Reporter: nijel >Assignee: nijel > Attachments: 0001-YARN-3771.patch > > > i was going through some find bugs rules. One issue reported in that is > public static final String[] DEFAULT_YARN_APPLICATION_CLASSPATH = { > and > public static final String[] > DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH= > is not honoring the final qualifier. The string array contents can be re > assigned ! > Simple test > {code} > public class TestClass { > static final String[] t = { "1", "2" }; > public static void main(String[] args) { > System.out.println(12 < 10); > String[] t1={"u"}; > //t = t1; // this will show compilation error > t (1) = t1 (1) ; // But this works > } > } > {code} > One option is to use Collections.unmodifiableList > any thoughts ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4130) Duplicate declaration of ApplicationId in RMAppManager
[ https://issues.apache.org/jira/browse/YARN-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734776#comment-14734776 ] Hadoop QA commented on YARN-4130: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 36s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 51s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 49s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 54m 31s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 93m 23s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMHA | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754621/YARN-4130.00.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 435f935 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9036/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9036/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9036/console | This message was automatically generated. > Duplicate declaration of ApplicationId in RMAppManager > -- > > Key: YARN-4130 > URL: https://issues.apache.org/jira/browse/YARN-4130 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Kai Sasaki >Assignee: Kai Sasaki >Priority: Trivial > Labels: resourcemanager > Attachments: YARN-4130.00.patch > > > ApplicationId is declared double in {{RMAppManager}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4022) queue not remove from webpage(/cluster/scheduler) when delete queue in xxx-scheduler.xml
[ https://issues.apache.org/jira/browse/YARN-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734751#comment-14734751 ] Hadoop QA commented on YARN-4022: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 20s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 59s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 50s | The applied patch generated 11 new checkstyle issues (total was 85, now 94). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 29s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 33s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 54m 7s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 92m 55s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754618/YARN-4022.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 435f935 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9035/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9035/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9035/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9035/console | This message was automatically generated. > queue not remove from webpage(/cluster/scheduler) when delete queue in > xxx-scheduler.xml > > > Key: YARN-4022 > URL: https://issues.apache.org/jira/browse/YARN-4022 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: forrestchen > Labels: scheduler > Attachments: YARN-4022.001.patch, YARN-4022.002.patch > > > When I delete an existing queue by modify the xxx-schedule.xml, I can still > see the queue information block in webpage(/cluster/scheduler) though the > 'Min Resources' items all become to zero and have no item of 'Max Running > Applications'. > I can still submit an application to the deleted queue and the application > will run using 'root.default' queue instead, but submit to an un-exist queue > will cause an exception. > My expectation is the deleted queue will not displayed in webpage and submit > application to the deleted queue will act just like the queue doesn't exist. > PS: There's no application running in the queue I delete. > Some related config in yarn-site.xml: > {code} > > yarn.scheduler.fair.user-as-default-queue > false > > > yarn.scheduler.fair.allow-undeclared-pools > false > > {code} > a related question is here: > http://stackoverflow.com/questions/26488564/hadoop-yarn-why-the-queue-cannot-be-deleted-after-i-revise-my-fair-scheduler-xm -- This message was sent by Atlassian JIRA (v6.3.4#6332)