[jira] [Commented] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482688#comment-14482688 ] Hadoop QA commented on YARN-3443: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723549/YARN-3443.004.patch against trunk revision 3fb5abf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7234//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7234//console This message is automatically generated. > Create a 'ResourceHandler' subsystem to ease addition of support for new > resource types on the NM > - > > Key: YARN-3443 > URL: https://issues.apache.org/jira/browse/YARN-3443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3443.001.patch, YARN-3443.002.patch, > YARN-3443.003.patch, YARN-3443.004.patch > > > The current cgroups implementation is closely tied to supporting CPU as a > resource . We need to separate out CGroups support as well a provide a simple > ResourceHandler subsystem that will enable us to add support for new resource > types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3366) Outbound network bandwidth : classify/shape traffic originating from YARN containers
[ https://issues.apache.org/jira/browse/YARN-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3366: Attachment: YARN-3366.002.patch Uploading a patch that includes changes to YarnConfiguration.java > Outbound network bandwidth : classify/shape traffic originating from YARN > containers > > > Key: YARN-3366 > URL: https://issues.apache.org/jira/browse/YARN-3366 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3366.001.patch, YARN-3366.002.patch > > > In order to be able to isolate based on/enforce outbound traffic bandwidth > limits, we need a mechanism to classify/shape network traffic in the > nodemanager. For more information on the design, please see the attached > design document in the parent JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3443: Attachment: YARN-3443.004.patch Patch with documentation fixes. > Create a 'ResourceHandler' subsystem to ease addition of support for new > resource types on the NM > - > > Key: YARN-3443 > URL: https://issues.apache.org/jira/browse/YARN-3443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3443.001.patch, YARN-3443.002.patch, > YARN-3443.003.patch, YARN-3443.004.patch > > > The current cgroups implementation is closely tied to supporting CPU as a > resource . We need to separate out CGroups support as well a provide a simple > ResourceHandler subsystem that will enable us to add support for new resource > types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482635#comment-14482635 ] Yongjun Zhang commented on YARN-3021: - Hi [~jianhe], I uploaded rev 007 to address your latest comment. I agree that the token renewer won't be empty in that case, and if we need to modify the definition of {{skipTokenRenewal}} in the future, we can add back the check at that time. Would you please take a look? Thanks. > YARN's delegation-token handling disallows certain trust setups to operate > properly over DistCp > --- > > Key: YARN-3021 > URL: https://issues.apache.org/jira/browse/YARN-3021 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 2.3.0 >Reporter: Harsh J >Assignee: Yongjun Zhang > Attachments: YARN-3021.001.patch, YARN-3021.002.patch, > YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, > YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.patch > > > Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, > and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN > clusters. > Now if one logs in with a COMMON credential, and runs a job on A's YARN that > needs to access B's HDFS (such as a DistCp), the operation fails in the RM, > as it attempts a renewDelegationToken(…) synchronously during application > submission (to validate the managed token before it adds it to a scheduler > for automatic renewal). The call obviously fails cause B realm will not trust > A's credentials (here, the RM's principal is the renewer). > In the 1.x JobTracker the same call is present, but it is done asynchronously > and once the renewal attempt failed we simply ceased to schedule any further > attempts of renewals, rather than fail the job immediately. > We should change the logic such that we attempt the renewal but go easy on > the failure and skip the scheduling alone, rather than bubble back an error > to the client, failing the app submission. This way the old behaviour is > retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated YARN-3021: Attachment: YARN-3021.007.patch > YARN's delegation-token handling disallows certain trust setups to operate > properly over DistCp > --- > > Key: YARN-3021 > URL: https://issues.apache.org/jira/browse/YARN-3021 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 2.3.0 >Reporter: Harsh J >Assignee: Yongjun Zhang > Attachments: YARN-3021.001.patch, YARN-3021.002.patch, > YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, > YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.patch > > > Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, > and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN > clusters. > Now if one logs in with a COMMON credential, and runs a job on A's YARN that > needs to access B's HDFS (such as a DistCp), the operation fails in the RM, > as it attempts a renewDelegationToken(…) synchronously during application > submission (to validate the managed token before it adds it to a scheduler > for automatic renewal). The call obviously fails cause B realm will not trust > A's credentials (here, the RM's principal is the renewer). > In the 1.x JobTracker the same call is present, but it is done asynchronously > and once the renewal attempt failed we simply ceased to schedule any further > attempts of renewals, rather than fail the job immediately. > We should change the logic such that we attempt the renewal but go easy on > the failure and skip the scheduling alone, rather than bubble back an error > to the client, failing the app submission. This way the old behaviour is > retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3404) View the queue name to YARN Application page
[ https://issues.apache.org/jira/browse/YARN-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryu Kobayashi updated YARN-3404: Attachment: YARN-3404.2.patch [~jianhe] Okay, I added each links. > View the queue name to YARN Application page > > > Key: YARN-3404 > URL: https://issues.apache.org/jira/browse/YARN-3404 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi >Priority: Minor > Attachments: YARN-3404.1.patch, YARN-3404.2.patch, screenshot.png > > > It want to display the name of the queue that is used to YARN Application > page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482591#comment-14482591 ] Varun Vasudev commented on YARN-3443: - Minor documentation fixes, everything else looks good. # In PrivilegedOperationExecutor.java, getPrivilegedOperationExecutionCommand is documented as throwing ExitCodeException but it doesn't throw it. # In CGroupsHandler.java, the documentation for createCGroup is missing descriptions for controller and path. > Create a 'ResourceHandler' subsystem to ease addition of support for new > resource types on the NM > - > > Key: YARN-3443 > URL: https://issues.apache.org/jira/browse/YARN-3443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3443.001.patch, YARN-3443.002.patch, > YARN-3443.003.patch > > > The current cgroups implementation is closely tied to supporting CPU as a > resource . We need to separate out CGroups support as well a provide a simple > ResourceHandler subsystem that will enable us to add support for new resource > types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3455) Document CGroup support
[ https://issues.apache.org/jira/browse/YARN-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482564#comment-14482564 ] Rohith commented on YARN-3455: -- On further checking found that CGroup is documented in YARN-2949. I wil go through the document and will close. If any further improvements can be done, will add a comment. > Document CGroup support > > > Key: YARN-3455 > URL: https://issues.apache.org/jira/browse/YARN-3455 > Project: Hadoop YARN > Issue Type: Task > Components: documentation >Reporter: Rohith > > It would be very useful if CGroup support is documented having sections like > below > # Introduction > # Configuring CGroups > # Any specific configuration that controls CPU scheduling > # How/when to use CGroups with some use case expanations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3455) Document CGroup support
Rohith created YARN-3455: Summary: Document CGroup support Key: YARN-3455 URL: https://issues.apache.org/jira/browse/YARN-3455 Project: Hadoop YARN Issue Type: Task Components: documentation Reporter: Rohith It would be very useful if CGroup support is documented having sections like below # Introduction # Configuring CGroups # Any specific configuration that controls CPU scheduling # How/when to use CGroups with some use case expanations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482505#comment-14482505 ] Harsh J commented on YARN-2424: --- [~sidharta-s] - Yes, it appears the warning was skipped in the branch-2 patch, likely by accident. Thanks for spotting this! Could you file a new YARN JIRA to port the warning back into branch-2? > LCE should support non-cgroups, non-secure mode > --- > > Key: YARN-2424 > URL: https://issues.apache.org/jira/browse/YARN-2424 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 >Reporter: Allen Wittenauer >Assignee: Allen Wittenauer >Priority: Blocker > Fix For: 2.6.0 > > Attachments: Y2424-1.patch, YARN-2424.patch > > > After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. > This is a fairly serious regression, as turning on LCE prior to turning on > full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniMRYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482488#comment-14482488 ] Mit Desai commented on YARN-2890: - These test failures are not related to the patch. These were also seen in MAPREDUCE-6293 which was not due to the patch. > MiniMRYarnCluster should turn on timeline service if configured to do so > > > Key: YARN-2890 > URL: https://issues.apache.org/jira/browse/YARN-2890 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, > YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, > YARN-2890.patch, YARN-2890.patch > > > Currently the MiniMRYarnCluster does not consider the configuration value for > enabling timeline service before starting. The MiniYarnCluster should only > start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482472#comment-14482472 ] Hadoop QA commented on YARN-2618: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch against trunk revision 3fb5abf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 22 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7231//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7231//console This message is automatically generated. > Avoid over-allocation of disk resources > --- > > Key: YARN-2618 > URL: https://issues.apache.org/jira/browse/YARN-2618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, > YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch > > > Subtask of YARN-2139. > This should include > - Add API support for introducing disk I/O as the 3rd type resource. > - NM should report this information to the RM > - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2618: -- Attachment: YARN-2618-7.patch Fix the testing errors. > Avoid over-allocation of disk resources > --- > > Key: YARN-2618 > URL: https://issues.apache.org/jira/browse/YARN-2618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, > YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch > > > Subtask of YARN-2139. > This should include > - Add API support for introducing disk I/O as the 3rd type resource. > - NM should report this information to the RM > - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniMRYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482401#comment-14482401 ] Hadoop QA commented on YARN-2890: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723428/YARN-2890.4.patch against trunk revision 28bebc8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.mapred.TestMRTimelineEventHandling org.apache.hadoop.mapreduce.v2.TestMRJobsWithProfiler org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService org.apache.hadoop.mapred.TestClusterMRNotification The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.mapred.TestJobCleanup org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers org.apache.hadoop.mapred.TestLazyOutput org.apache.hadoop.mapred.TestMiniMRChildTask org.apache.hadoop.mapreduce.v2.TestMRJobs org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution org.apache.hadoop.mapreduce.v2.TestUberAM org.apache.hadoop.mapreduce.TestMRJobClient org.apache.hadoop.mapreduce.TestMapReduceLazyOutput org.apache.hadoop.mapreduce.TestLargeSort The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7226//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7226//console This message is automatically generated. > MiniMRYarnCluster should turn on timeline service if configured to do so > > > Key: YARN-2890 > URL: https://issues.apache.org/jira/browse/YARN-2890 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, > YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, > YARN-2890.patch, YARN-2890.patch > > > Currently the MiniMRYarnCluster does not consider the configuration value for > enabling timeline service before starting. The MiniYarnCluster should only > start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3454) RLESparseResourceAllocation does not handle removal of partial intervals
[ https://issues.apache.org/jira/browse/YARN-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482370#comment-14482370 ] Carlo Curino commented on YARN-3454: Adding a capacity reservation (e.g., 10 containers) between time 10 and 20, and then removing the same containers from an interval between 12-18 does not work correctly. Only exact interval match work. This is normally not exercised in the Reservation sub-system, but for further enhancements we are working on this is needed. More generally this is a bug we should get rid off. > RLESparseResourceAllocation does not handle removal of partial intervals > > > Key: YARN-3454 > URL: https://issues.apache.org/jira/browse/YARN-3454 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Carlo Curino > > The RLESparseResourceAllocation.removeInterval(...) method handles well exact > match interval removals, but does not handles correctly partial overlaps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3454) RLESparseResourceAllocation does not handle removal of partial intervals
Carlo Curino created YARN-3454: -- Summary: RLESparseResourceAllocation does not handle removal of partial intervals Key: YARN-3454 URL: https://issues.apache.org/jira/browse/YARN-3454 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Carlo Curino The RLESparseResourceAllocation.removeInterval(...) method handles well exact match interval removals, but does not handles correctly partial overlaps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482328#comment-14482328 ] Zhijie Shen commented on YARN-3431: --- bq. Are those endpoints just placeholders so that we can specialize each of them? Or else, I'm not sure about the motivation behind this (currently no description for this JIRA...). Could you please elaborate a little bit more on this? The problem is that we have TimelineEntity and the all the sub classes of it. On the other side, we have a single endpoint, which consume TimelineEntity. Therefore, this endpoint will check the incoming request body contains exactly TimelineEntity object. The json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding Sub-class object. I tried to figure out if JAX-RS has the general approach, but didn't have the answer (please let me know if anyone has the idea). Alternatively, I choose treat the predefined sub classes as the sub resources, and put them on separate endpoints. Once deserialized at the server side, java can identify TimelineEntity objects' classes and then treat them accordingly. So we don't need separate Java APIs in the collector. > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482303#comment-14482303 ] Hadoop QA commented on YARN-2901: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723490/YARN-2901.addendem.1.patch against trunk revision 3fb5abf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7230//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7230//console This message is automatically generated. > Add errors and warning metrics page to RM, NM web UI > > > Key: YARN-2901 > URL: https://issues.apache.org/jira/browse/YARN-2901 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: Exception collapsed.png, Exception expanded.jpg, Screen > Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, > apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, > apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch > > > It would be really useful to have statistics on the number of errors and > warnings in the RM and NM web UI. I'm thinking about - > 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day > 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 > hours/day > By errors and warnings I'm referring to the log level. > I suspect we can probably achieve this by writing a custom appender?(I'm open > to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482298#comment-14482298 ] Junping Du commented on YARN-3225: -- Thanks [~devaraj.k] for the patch! The latest patch looks good in overall. Some minor comments: {code} + private long validateTimeout(String strTimeout) { +long timeout; +try { + timeout = Long.parseLong(strTimeout); +} catch (NumberFormatException ex) { + throw new IllegalArgumentException(INVALID_TIMEOUT_ERR_MSG + strTimeout); +} +if (timeout < 0) { + throw new IllegalArgumentException(INVALID_TIMEOUT_ERR_MSG + timeout); +} +return timeout; + } {code} I think we should support a case that Admin want node get decommissioned whenever all apps on these node get finished. If so, shall we support nigative value (anyone or some special one, like: -1) to specify this case? javadoc in DecommissionType.java {code} + /** Decomissioning nodes **/ + NORMAL, + + /** Graceful decommissioning of nodes **/ + GRACEFUL, + + /** Forceful decommissioning of nodes **/ + FORCEFUL {code} For NORMAL, shall we use "Decommission nodes in normal (old) way" instead or something simpler- "Decommission nodes"? {code} +@Private +@Unstable +public abstract class CheckForDecommissioningNodesRequest { + @Public + @Unstable + public static CheckForDecommissioningNodesRequest newInstance() { +CheckForDecommissioningNodesRequest request = Records +.newRecord(CheckForDecommissioningNodesRequest.class); +return request; + } +} {code} IMO, the methods inside a class should't be more public than class itself? If we don't expect other projects to use class, we alwasy don't expect some methods get used. The same problem happen in an old API RefreshNodeRequest.java. I think we may need to fix both? {code} @Test public void testRefreshNodes() throws Exception { resourceManager.getClientRMService(); -RefreshNodesRequest request = recordFactory -.newRecordInstance(RefreshNodesRequest.class); +RefreshNodesRequest request = RefreshNodesRequest +.newInstance(DecommissionType.NORMAL); RefreshNodesResponse response = client.refreshNodes(request); assertNotNull(response); } {code} Why do we need this change? recordFactory.newRecordInstance(RefreshNodesRequest.class) will return something with DecommissionType.NORMAL as default. No? > New parameter or CLI for decommissioning node gracefully in RMAdmin CLI > --- > > Key: YARN-3225 > URL: https://issues.apache.org/jira/browse/YARN-3225 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Devaraj K > Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, > YARN-3225.patch, YARN-914.patch > > > New CLI (or existing CLI with parameters) should put each node on > decommission list to decommissioning status and track timeout to terminate > the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482281#comment-14482281 ] Hadoop QA commented on YARN-3426: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723488/YARN-3426-040615-1.patch against trunk revision 3fb5abf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 4 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common: org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7229//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7229//artifact/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7229//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7229//console This message is automatically generated. > Add jdiff support to YARN > - > > Key: YARN-3426 > URL: https://issues.apache.org/jira/browse/YARN-3426 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu >Priority: Blocker > Attachments: YARN-3426-040615-1.patch, YARN-3426-040615.patch > > > Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs > to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482266#comment-14482266 ] Li Lu commented on YARN-3431: - Hi [~zjshen], thanks for working on this! I reviewed your v2 patch, and the code LGTM. However, I'm a little bit confused about the big picture of this patch. In this patch you're setting up separate REST endpoints to post different types of timeline entities. However, all different REST endpoints have exactly the same internal logic, redirecting the incoming entity to the collector's putEntity. Are those endpoints just placeholders so that we can specialize each of them? Or else, I'm not sure about the motivation behind this (currently no description for this JIRA...). Could you please elaborate a little bit more on this? BTW, I agree we need to specialize for different types of timeline entities, but maybe we need to do this on the collector/storage side? For storage layer design we need to write down the detailed timeline entities so specialization would be helpful. > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1376) NM need to notify the log aggregation status to RM through Node heartbeat
[ https://issues.apache.org/jira/browse/YARN-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482248#comment-14482248 ] Junping Du commented on YARN-1376: -- Hi [~xgong], thanks for the patch! Some major comments after going through the patch here: - Shall we put LogAggregationStatus and LogAggregationReport (and related pb impl) on server-api instead of yarn api? We will not expose it to application, so better to put in server side. - I didn't see where we remove element from logAggregationReportForApps. I think we need to remove it when log aggregation finished or it will still occupy (and may eat up gradually) NM's memory. > NM need to notify the log aggregation status to RM through Node heartbeat > - > > Key: YARN-1376 > URL: https://issues.apache.org/jira/browse/YARN-1376 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-1376.1.patch, YARN-1376.2.patch, YARN-1376.2.patch, > YARN-1376.2015-04-04.patch, YARN-1376.2015-04-06.patch, YARN-1376.3.patch, > YARN-1376.4.patch > > > Expose a client API to allow clients to figure if log aggregation is > complete. The ticket is used to track the changes on NM side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482218#comment-14482218 ] Li Lu commented on YARN-2901: - +1 pending Jenkins. > Add errors and warning metrics page to RM, NM web UI > > > Key: YARN-2901 > URL: https://issues.apache.org/jira/browse/YARN-2901 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: Exception collapsed.png, Exception expanded.jpg, Screen > Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, > apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, > apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch > > > It would be really useful to have statistics on the number of errors and > warnings in the RM and NM web UI. I'm thinking about - > 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day > 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 > hours/day > By errors and warnings I'm referring to the log level. > I suspect we can probably achieve this by writing a custom appender?(I'm open > to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2901: - Attachment: YARN-2901.addendem.1.patch Thanks for reporting, [~gtCarrera9]. This is false alarm from findbugs. Fields of org.apache.hadoop.yarn.util.Log4jWarningErrorMetricsAppender$Element are used by ErrosAndWarningsBlock. Simply exclude such warnings. Uploaded addendum patch and pending Jenkins. > Add errors and warning metrics page to RM, NM web UI > > > Key: YARN-2901 > URL: https://issues.apache.org/jira/browse/YARN-2901 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: Exception collapsed.png, Exception expanded.jpg, Screen > Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, > apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, > apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch > > > It would be really useful to have statistics on the number of errors and > warnings in the RM and NM web UI. I'm thinking about - > 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day > 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 > hours/day > By errors and warnings I'm referring to the log level. > I suspect we can probably achieve this by writing a custom appender?(I'm open > to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3426: Attachment: YARN-3426-040615-1.patch renamed one property for dev-support directory location. > Add jdiff support to YARN > - > > Key: YARN-3426 > URL: https://issues.apache.org/jira/browse/YARN-3426 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu >Priority: Blocker > Attachments: YARN-3426-040615-1.patch, YARN-3426-040615.patch > > > Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs > to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482203#comment-14482203 ] Naganarasimha G R commented on YARN-3110: - Hi [~xgong], I have rebased and manually tested with the trunk code and able to see the modifications in the web ui . As the modifications are related to the web ui i have not written the testcode. Can you please check now ? > Few issues in ApplicationHistory web ui > --- > > Key: YARN-3110 > URL: https://issues.apache.org/jira/browse/YARN-3110 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, timelineserver >Affects Versions: 2.6.0 >Reporter: Bibin A Chundatt >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, > YARN-3110.20150406-1.patch > > > Application state and History link wrong when Application is in unassigned > state > > 1.Configure capacity schedular with queue size as 1 also max Absolute Max > Capacity: 10.0% > (Current application state is Accepted and Unassigned from resource manager > side) > 2.Submit application to queue and check the state and link in Application > history > State= null and History link shown as N/A in applicationhistory page > Kill the same application . In timeline server logs the below is show when > selecting application link. > {quote} > 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to > read the AM container of the application attempt > appattempt_1422467063659_0007_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) > at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at
[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3431: -- Attachment: YARN-3431.2.patch Rebase the patch after YARN-3334. > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3391: -- Attachment: YARN-3391.2.patch Rebase the patch after YARN-3334. Comments are welcome > Clearly define flow ID/ flow run / flow version in API and storage > -- > > Key: YARN-3391 > URL: https://issues.apache.org/jira/browse/YARN-3391 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3391.1.patch, YARN-3391.2.patch > > > To continue the discussion in YARN-3040, let's figure out the best way to > describe the flow. > Some key issues that we need to conclude on: > - How do we include the flow version in the context so that it gets passed > into the collector and to the storage eventually? > - Flow run id should be a number as opposed to a generic string? > - Default behavior for the flow run id if it is missing (i.e. client did not > set it) > - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482053#comment-14482053 ] Li Lu commented on YARN-3426: - The findbugs warnings are unrelated here. Reopened YARN-2901 to trace it. > Add jdiff support to YARN > - > > Key: YARN-3426 > URL: https://issues.apache.org/jira/browse/YARN-3426 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu >Priority: Blocker > Attachments: YARN-3426-040615.patch > > > Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs > to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482051#comment-14482051 ] Li Lu commented on YARN-2901: - BTW, Jenkins didn't report those two warnings in this JIRA because probably it ran against another patch in the run 4 days ago. > Add errors and warning metrics page to RM, NM web UI > > > Key: YARN-2901 > URL: https://issues.apache.org/jira/browse/YARN-2901 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: Exception collapsed.png, Exception expanded.jpg, Screen > Shot 2015-03-19 at 7.40.02 PM.png, apache-yarn-2901.0.patch, > apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, > apache-yarn-2901.4.patch, apache-yarn-2901.5.patch > > > It would be really useful to have statistics on the number of errors and > warnings in the RM and NM web UI. I'm thinking about - > 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day > 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 > hours/day > By errors and warnings I'm referring to the log level. > I suspect we can probably achieve this by writing a custom appender?(I'm open > to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu reopened YARN-2901: - The recent commit added two findbugs warnings on my local machine and in the build of YARN-3426: {code} CodeWarning UrF Unread public/protected field: org.apache.hadoop.yarn.util.Log4jWarningErrorMetricsAppender$Element.count UrF Unread public/protected field: org.apache.hadoop.yarn.util.Log4jWarningErrorMetricsAppender$Element.timestampSeconds {code} >From git blame this JIRA is the one performed the most recent change. Reopen >this JIRA to fix them. > Add errors and warning metrics page to RM, NM web UI > > > Key: YARN-2901 > URL: https://issues.apache.org/jira/browse/YARN-2901 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: Exception collapsed.png, Exception expanded.jpg, Screen > Shot 2015-03-19 at 7.40.02 PM.png, apache-yarn-2901.0.patch, > apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, > apache-yarn-2901.4.patch, apache-yarn-2901.5.patch > > > It would be really useful to have statistics on the number of errors and > warnings in the RM and NM web UI. I'm thinking about - > 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day > 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 > hours/day > By errors and warnings I'm referring to the log level. > I suspect we can probably achieve this by writing a custom appender?(I'm open > to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482034#comment-14482034 ] zhihai xu commented on YARN-3429: - thanks [~rkanter] for reviewing and committing the patch. > TestAMRMTokens.testTokenExpiry fails Intermittently with error > message:Invalid AMRMToken > > > Key: YARN-3429 > URL: https://issues.apache.org/jira/browse/YARN-3429 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.8.0 > > Attachments: YARN-3429.000.patch > > > TestAMRMTokens.testTokenExpiry fails Intermittently with error > message:Invalid AMRMToken from appattempt_1427804754787_0001_01 > The error logs is at > https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2942) Aggregated Log Files should be combined
[ https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482030#comment-14482030 ] Robert Kanter commented on YARN-2942: - [~vinodkv], I was discussing this with some of our HDFS people, and they think using concat would do less (potentially much less) to actually result in NN metadata savings; instead of the original design of using append and rereading the files. I agree that it would be best if HDFS supported atomic append (with concurrent writers) and rereading the files isn't ideal, but it seems like the original design is the best solution for the issue at hand for now. Thoughts? > Aggregated Log Files should be combined > --- > > Key: YARN-2942 > URL: https://issues.apache.org/jira/browse/YARN-2942 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.6.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: CombinedAggregatedLogsProposal_v3.pdf, > CompactedAggregatedLogsProposal_v1.pdf, > CompactedAggregatedLogsProposal_v2.pdf, > ConcatableAggregatedLogsProposal_v4.pdf, > ConcatableAggregatedLogsProposal_v5.pdf, YARN-2942-preliminary.001.patch, > YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, > YARN-2942.003.patch > > > Turning on log aggregation allows users to easily store container logs in > HDFS and subsequently view them in the YARN web UIs from a central place. > Currently, there is a separate log file for each Node Manager. This can be a > problem for HDFS if you have a cluster with many nodes as you’ll slowly start > accumulating many (possibly small) files per YARN application. The current > “solution” for this problem is to configure YARN (actually the JHS) to > automatically delete these files after some amount of time. > We should improve this by compacting the per-node aggregated log files into > one log file per application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482026#comment-14482026 ] Hadoop QA commented on YARN-3443: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723436/YARN-3443.003.patch against trunk revision 28bebc8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7228//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7228//console This message is automatically generated. > Create a 'ResourceHandler' subsystem to ease addition of support for new > resource types on the NM > - > > Key: YARN-3443 > URL: https://issues.apache.org/jira/browse/YARN-3443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3443.001.patch, YARN-3443.002.patch, > YARN-3443.003.patch > > > The current cgroups implementation is closely tied to supporting CPU as a > resource . We need to separate out CGroups support as well a provide a simple > ResourceHandler subsystem that will enable us to add support for new resource > types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482024#comment-14482024 ] Hadoop QA commented on YARN-3426: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723432/YARN-3426-040615.patch against trunk revision 28bebc8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 4 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7227//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7227//artifact/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7227//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7227//console This message is automatically generated. > Add jdiff support to YARN > - > > Key: YARN-3426 > URL: https://issues.apache.org/jira/browse/YARN-3426 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu >Priority: Blocker > Attachments: YARN-3426-040615.patch > > > Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs > to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging
[ https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481974#comment-14481974 ] Hudson commented on YARN-3273: -- FAILURE: Integrated in Hadoop-trunk-Commit #7516 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7516/]) Move YARN-3273 from 2.8 to 2.7. (zjshen: rev 3fb5abfc87953377f86e06578518801a181d7697) * hadoop-yarn-project/CHANGES.txt > Improve web UI to facilitate scheduling analysis and debugging > -- > > Key: YARN-3273 > URL: https://issues.apache.org/jira/browse/YARN-3273 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Rohith > Fix For: 2.7.0 > > Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, > 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, > 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, > YARN-3273-am-resource-used-AND-User-limit.PNG, > YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG > > > Job may be stuck for reasons such as: > - hitting queue capacity > - hitting user-limit, > - hitting AM-resource-percentage > The first queueCapacity is already shown on the UI. > We may surface things like: > - what is user's current usage and user-limit; > - what is the AM resource usage and limit; > - what is the application's current HeadRoom; > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2429) LCE should blacklist based upon group
[ https://issues.apache.org/jira/browse/YARN-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481975#comment-14481975 ] Hudson commented on YARN-2429: -- FAILURE: Integrated in Hadoop-trunk-Commit #7516 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7516/]) YARN-2429. TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken (zxu via rkanter) (rkanter: rev 99b08a748e7b00a58b63330b353902a6da6aae27) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java * hadoop-yarn-project/CHANGES.txt > LCE should blacklist based upon group > - > > Key: YARN-2429 > URL: https://issues.apache.org/jira/browse/YARN-2429 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Allen Wittenauer > > It should be possible to list a group to ban, not just individual users. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-3430. --- Resolution: Fixed After pull YARN-3273 into branch-2.7. Commit this patch again into branch-2.7 > RMAppAttempt headroom data is missing in RM Web UI > -- > > Key: YARN-3430 > URL: https://issues.apache.org/jira/browse/YARN-3430 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, webapp, yarn >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Blocker > Fix For: 2.7.0 > > Attachments: YARN-3430.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging
[ https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3273: -- Fix Version/s: (was: 2.8.0) 2.7.0 Merged the commit to branch-2.7. > Improve web UI to facilitate scheduling analysis and debugging > -- > > Key: YARN-3273 > URL: https://issues.apache.org/jira/browse/YARN-3273 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Rohith > Fix For: 2.7.0 > > Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, > 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, > 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, > YARN-3273-am-resource-used-AND-User-limit.PNG, > YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG > > > Job may be stuck for reasons such as: > - hitting queue capacity > - hitting user-limit, > - hitting AM-resource-percentage > The first queueCapacity is already shown on the UI. > We may surface things like: > - what is user's current usage and user-limit; > - what is the AM resource usage and limit; > - what is the application's current HeadRoom; > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481963#comment-14481963 ] Robert Kanter commented on YARN-3429: - Thanks Zhihai. Committed to trunk and branch-2! > TestAMRMTokens.testTokenExpiry fails Intermittently with error > message:Invalid AMRMToken > > > Key: YARN-3429 > URL: https://issues.apache.org/jira/browse/YARN-3429 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.8.0 > > Attachments: YARN-3429.000.patch > > > TestAMRMTokens.testTokenExpiry fails Intermittently with error > message:Invalid AMRMToken from appattempt_1427804754787_0001_01 > The error logs is at > https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481934#comment-14481934 ] Robert Kanter commented on YARN-3429: - +1 > TestAMRMTokens.testTokenExpiry fails Intermittently with error > message:Invalid AMRMToken > > > Key: YARN-3429 > URL: https://issues.apache.org/jira/browse/YARN-3429 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3429.000.patch > > > TestAMRMTokens.testTokenExpiry fails Intermittently with error > message:Invalid AMRMToken from appattempt_1427804754787_0001_01 > The error logs is at > https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sidharta Seethana updated YARN-3443: Attachment: YARN-3443.003.patch Patch that incorporates code review feedback from [~vvasudev] > Create a 'ResourceHandler' subsystem to ease addition of support for new > resource types on the NM > - > > Key: YARN-3443 > URL: https://issues.apache.org/jira/browse/YARN-3443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3443.001.patch, YARN-3443.002.patch, > YARN-3443.003.patch > > > The current cgroups implementation is closely tied to supporting CPU as a > resource . We need to separate out CGroups support as well a provide a simple > ResourceHandler subsystem that will enable us to add support for new resource > types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3426: Attachment: YARN-3426-040615.patch In this patch I added jdiff support to YARN maven file. We're checking API compatibility for yarn-api, yarn-common, yarn-client and yarn-server-common now. I'm also attaching standard API file for those four components. jdiff result is generated via {{mvn package -Pdocs}}, same as hadoop-common and hadoop-hdfs. > Add jdiff support to YARN > - > > Key: YARN-3426 > URL: https://issues.apache.org/jira/browse/YARN-3426 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu >Priority: Blocker > Attachments: YARN-3426-040615.patch > > > Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs > to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3453) Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
Ashwin Shankar created YARN-3453: Summary: Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is >= fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3452) Bogus token usernames cause many invalid group lookups
Jason Lowe created YARN-3452: Summary: Bogus token usernames cause many invalid group lookups Key: YARN-3452 URL: https://issues.apache.org/jira/browse/YARN-3452 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Jason Lowe YARN uses a number of bogus usernames for tokens, like application attempt IDs for NM tokens or even the hardcoded "testing" for the container localizer token. These tokens cause the RPC layer to do group lookups on these bogus usernames which will never succeed but can take a long time to perform. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481881#comment-14481881 ] Yongjun Zhang commented on YARN-3021: - Hi [~jianhe], Thanks for taking a further look. No worry about the delay, I guessed you were out. About your comment, the code {code} private void collectDelegationTokens(final String renewer, final Credentials credentials, final List> tokens) throws IOException { final String serviceName = getCanonicalServiceName(); // Collect token of the this filesystem and then of its embedded children if (serviceName != null) { // fs has token, grab it final Text service = new Text(serviceName); Token token = credentials.getToken(service); < if (token == null) { token = getDelegationToken(renewer); if (token != null) { tokens.add(token); credentials.addToken(service, token); } } } {code} The line highlighted with "<===" indicates that a token could be retrieved from the token map. In this case, are we sure that they always have a non-empty renewer? In addition, it's possible that we might change the {{skipTokenRenewer}} method in the future to do some additional checking. Seems safer to have this check. Do you think we should just keep this checking? Thanks. > YARN's delegation-token handling disallows certain trust setups to operate > properly over DistCp > --- > > Key: YARN-3021 > URL: https://issues.apache.org/jira/browse/YARN-3021 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 2.3.0 >Reporter: Harsh J >Assignee: Yongjun Zhang > Attachments: YARN-3021.001.patch, YARN-3021.002.patch, > YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, > YARN-3021.006.patch, YARN-3021.patch > > > Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, > and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN > clusters. > Now if one logs in with a COMMON credential, and runs a job on A's YARN that > needs to access B's HDFS (such as a DistCp), the operation fails in the RM, > as it attempts a renewDelegationToken(…) synchronously during application > submission (to validate the managed token before it adds it to a scheduler > for automatic renewal). The call obviously fails cause B realm will not trust > A's credentials (here, the RM's principal is the renewer). > In the 1.x JobTracker the same call is present, but it is done asynchronously > and once the renewal attempt failed we simply ceased to schedule any further > attempts of renewals, rather than fail the job immediately. > We should change the logic such that we attempt the renewal but go easy on > the failure and skip the scheduling alone, rather than bubble back an error > to the client, failing the app submission. This way the old behaviour is > retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2890) MiniMRYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2890: Attachment: YARN-2890.4.patch [~hitesh], Thanks for the comments. Attached updated patch. Created a new test file TestMiniYarnCluster that tests the the starting of timelineserver based on the configuration and enableAHS flag. > MiniMRYarnCluster should turn on timeline service if configured to do so > > > Key: YARN-2890 > URL: https://issues.apache.org/jira/browse/YARN-2890 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, > YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, > YARN-2890.patch, YARN-2890.patch > > > Currently the MiniMRYarnCluster does not consider the configuration value for > enabling timeline service before starting. The MiniYarnCluster should only > start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging
[ https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481673#comment-14481673 ] Zhijie Shen commented on YARN-3273: --- Thanks for your confirmation, Jian! Will do it. > Improve web UI to facilitate scheduling analysis and debugging > -- > > Key: YARN-3273 > URL: https://issues.apache.org/jira/browse/YARN-3273 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Rohith > Fix For: 2.8.0 > > Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, > 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, > 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, > YARN-3273-am-resource-used-AND-User-limit.PNG, > YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG > > > Job may be stuck for reasons such as: > - hitting queue capacity > - hitting user-limit, > - hitting AM-resource-percentage > The first queueCapacity is already shown on the UI. > We may surface things like: > - what is user's current usage and user-limit; > - what is the AM resource usage and limit; > - what is the application's current HeadRoom; > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3449. -- Resolution: Invalid Assignee: (was: Junping Du) > Recover appTokenKeepAliveMap upon nodemanager restart > - > > Key: YARN-3449 > URL: https://issues.apache.org/jira/browse/YARN-3449 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.6.0, 2.7.0 >Reporter: Junping Du > > appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application > alive after application is finished but NM still need app token to do log > aggregation (when enable security and log aggregation). > The applications are only inserted into this map when receiving > getApplicationsToCleanup() from RM heartbeat response. And RM only send this > info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM > restart work preserving should put appTokenKeepAliveMap into NMStateStore and > get recovered after restart. Without doing this, RM could terminate > application earlier, so log aggregation could be failed if security is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481606#comment-14481606 ] Junping Du commented on YARN-3449: -- bq. Again when the NM re-registers it will report all active applications, and the RM will attempt to correct this on the next heartbeat. You are right, [~jlowe]. I think I could miss CLEANUP_APP would be resent in node reconnection (totally forget it for some strange reason). So that shouldn't be a problem. BTW, I didn't see any actual failure on this, so I will resolve it as invalid. > Recover appTokenKeepAliveMap upon nodemanager restart > - > > Key: YARN-3449 > URL: https://issues.apache.org/jira/browse/YARN-3449 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.6.0, 2.7.0 >Reporter: Junping Du >Assignee: Junping Du > > appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application > alive after application is finished but NM still need app token to do log > aggregation (when enable security and log aggregation). > The applications are only inserted into this map when receiving > getApplicationsToCleanup() from RM heartbeat response. And RM only send this > info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM > restart work preserving should put appTokenKeepAliveMap into NMStateStore and > get recovered after restart. Without doing this, RM could terminate > application earlier, so log aggregation could be failed if security is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3273) Improve web UI to facilitate scheduling analysis and debugging
[ https://issues.apache.org/jira/browse/YARN-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481557#comment-14481557 ] Jian He commented on YARN-3273: --- sure, please go ahead. > Improve web UI to facilitate scheduling analysis and debugging > -- > > Key: YARN-3273 > URL: https://issues.apache.org/jira/browse/YARN-3273 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Rohith > Fix For: 2.8.0 > > Attachments: 0001-YARN-3273-v1.patch, 0001-YARN-3273-v2.patch, > 0002-YARN-3273.patch, 0003-YARN-3273.patch, 0003-YARN-3273.patch, > 0004-YARN-3273.patch, YARN-3273-am-resource-used-AND-User-limit-v2.PNG, > YARN-3273-am-resource-used-AND-User-limit.PNG, > YARN-3273-application-headroom-v2.PNG, YARN-3273-application-headroom.PNG > > > Job may be stuck for reasons such as: > - hitting queue capacity > - hitting user-limit, > - hitting AM-resource-percentage > The first queueCapacity is already shown on the UI. > We may surface things like: > - what is user's current usage and user-limit; > - what is the AM resource usage and limit; > - what is the application's current HeadRoom; > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-197) Add a separate log server
[ https://issues.apache.org/jira/browse/YARN-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481525#comment-14481525 ] Siddharth Seth commented on YARN-197: - Yes, as long as the logs are being served out by a sub-system other than the MapReduce history server. > Add a separate log server > - > > Key: YARN-197 > URL: https://issues.apache.org/jira/browse/YARN-197 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Siddharth Seth > > Currently, the job history server is being used for log serving. A separate > log server can be added which can deal with serving logs, along with other > functionality like log retention, merging, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3404) View the queue name to YARN Application page
[ https://issues.apache.org/jira/browse/YARN-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481511#comment-14481511 ] Jian He commented on YARN-3404: --- [~ryu_kobayashi], thanks for the patch ! Could you also add a link for the queue name to the actual scheduler queue page ? Similarly, the existing user name can also be a link to the "Active Users Info" on scheduler page. > View the queue name to YARN Application page > > > Key: YARN-3404 > URL: https://issues.apache.org/jira/browse/YARN-3404 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi >Priority: Minor > Attachments: YARN-3404.1.patch, screenshot.png > > > It want to display the name of the queue that is used to YARN Application > page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481496#comment-14481496 ] Jian He commented on YARN-3021: --- [~yzhangal], I was out last couple weeks. sorry for the late response. Patch looks good overall, one comment: the {{skipTokenRenewal(token)}} check in {{requestNewHdfsDelegationToken}} may be not needed because it's explicitly passing {{UserGroupInformation.getLoginUser().getUserName()}} as the renewer, and so the token "renewer" won't be empty. > YARN's delegation-token handling disallows certain trust setups to operate > properly over DistCp > --- > > Key: YARN-3021 > URL: https://issues.apache.org/jira/browse/YARN-3021 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 2.3.0 >Reporter: Harsh J >Assignee: Yongjun Zhang > Attachments: YARN-3021.001.patch, YARN-3021.002.patch, > YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, > YARN-3021.006.patch, YARN-3021.patch > > > Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, > and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN > clusters. > Now if one logs in with a COMMON credential, and runs a job on A's YARN that > needs to access B's HDFS (such as a DistCp), the operation fails in the RM, > as it attempts a renewDelegationToken(…) synchronously during application > submission (to validate the managed token before it adds it to a scheduler > for automatic renewal). The call obviously fails cause B realm will not trust > A's credentials (here, the RM's principal is the renewer). > In the 1.x JobTracker the same call is present, but it is done asynchronously > and once the renewal attempt failed we simply ceased to schedule any further > attempts of renewals, rather than fail the job immediately. > We should change the logic such that we attempt the renewal but go easy on > the failure and skip the scheduling alone, rather than bubble back an error > to the client, failing the app submission. This way the old behaviour is > retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3443) Create a 'ResourceHandler' subsystem to ease addition of support for new resource types on the NM
[ https://issues.apache.org/jira/browse/YARN-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481471#comment-14481471 ] Sidharta Seethana commented on YARN-3443: - Thanks for the review, [~vvasudev] . Responses below : 1. I could change such log lines to use StringBuffer everywhere. However, I think a metric based on calls to LOG.warn()/LOG.error() does not accurately reflect the warn/error 'even't count. 2. I'll add an info log line to the if block. I had added this block for better readability since the behavior here is a little different from the current implementation in CgroupsLCEResourcesHandler 3. - 4. Same comment as 1. 5. Sure, I can make this change. 6. This is by design. Otherwise, every resource handler implementation that uses cgroups will have to check if cgroup mounting is enabled or not (which is error-prone). It seemed better to instead ignore a mount request when cgroup mounting is disabled. 7. I'll fix it. 8. I'll fix it. 9. I'll fix it. 10. Yikes. Yes, I'll fix it. Not sure how this one got through. 11. I had added it for clarity, but maybe it isn't necessary. I'll remove it. 12. I'll fix it. 13. I applied a formatter to all file before creating the patch - but, I'll verify. > Create a 'ResourceHandler' subsystem to ease addition of support for new > resource types on the NM > - > > Key: YARN-3443 > URL: https://issues.apache.org/jira/browse/YARN-3443 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Attachments: YARN-3443.001.patch, YARN-3443.002.patch > > > The current cgroups implementation is closely tied to supporting CPU as a > resource . We need to separate out CGroups support as well a provide a simple > ResourceHandler subsystem that will enable us to add support for new resource > types on the NM - e.g Network, Disk etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481450#comment-14481450 ] Hadoop QA commented on YARN-3110: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723370/YARN-3110.20150406-1.patch against trunk revision 28bebc8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7225//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7225//console This message is automatically generated. > Few issues in ApplicationHistory web ui > --- > > Key: YARN-3110 > URL: https://issues.apache.org/jira/browse/YARN-3110 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, timelineserver >Affects Versions: 2.6.0 >Reporter: Bibin A Chundatt >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, > YARN-3110.20150406-1.patch > > > Application state and History link wrong when Application is in unassigned > state > > 1.Configure capacity schedular with queue size as 1 also max Absolute Max > Capacity: 10.0% > (Current application state is Accepted and Unassigned from resource manager > side) > 2.Submit application to queue and check the state and link in Application > history > State= null and History link shown as N/A in applicationhistory page > Kill the same application . In timeline server logs the below is show when > selecting application link. > {quote} > 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to > read the AM container of the application attempt > appattempt_1422467063659_0007_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) > at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
[jira] [Commented] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481433#comment-14481433 ] zhihai xu commented on YARN-2666: - thanks [~ywskycn] for the review and thanks [~ozawa] for reviewing and committing the patch! > TestFairScheduler.testContinuousScheduling fails Intermittently > --- > > Key: YARN-2666 > URL: https://issues.apache.org/jira/browse/YARN-2666 > Project: Hadoop YARN > Issue Type: Test > Components: scheduler >Reporter: Tsuyoshi Ozawa >Assignee: zhihai xu > Fix For: 2.8.0 > > Attachments: YARN-2666.000.patch > > > The test fails on trunk. > {code} > Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler > testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) > Time elapsed: 0.582 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481425#comment-14481425 ] Naganarasimha G R commented on YARN-3045: - :) parallely had commented ... Will start working on this! > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481424#comment-14481424 ] Naganarasimha G R commented on YARN-3045: - :) parallely had commented ... Will start working on this! > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481412#comment-14481412 ] Naganarasimha G R commented on YARN-3045: - [~djp], As YARN-3334 is in, can I start with this jira ? > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481410#comment-14481410 ] Zhijie Shen commented on YARN-3045: --- Container metrics publishing has been completed in YARN-3334, please continue the work around NM lifecycle events here. Change the title accordingly. > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3334) [Event Producers] NM TimelineClient container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-3334. --- Resolution: Fixed Fix Version/s: YARN-2928 Hadoop Flags: Reviewed Committed the patch to branch YARN-2928. Thanks for the patch, Junping! Thanks for review, Sangjin and Li! > [Event Producers] NM TimelineClient container metrics posting to new timeline > service. > -- > > Key: YARN-3334 > URL: https://issues.apache.org/jira/browse/YARN-3334 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Junping Du > Fix For: YARN-2928 > > Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, > YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, > YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334-v8.patch, YARN-3334.7.patch > > > After YARN-3039, we have service discovery mechanism to pass app-collector > service address among collectors, NMs and RM. In this JIRA, we will handle > service address setting for TimelineClients in NodeManager, and put container > metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3045: -- Summary: [Event producers] Implement NM writing container lifecycle events to ATS (was: [Event producers] Implement NM writing container lifecycle events and container system metrics to ATS) > [Event producers] Implement NM writing container lifecycle events to ATS > > > Key: YARN-3045 > URL: https://issues.apache.org/jira/browse/YARN-3045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > > Per design in YARN-2928, implement NM writing container lifecycle events and > container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3334) [Event Producers] NM TimelineClient container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3334: -- Summary: [Event Producers] NM TimelineClient container metrics posting to new timeline service. (was: [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.) > [Event Producers] NM TimelineClient container metrics posting to new timeline > service. > -- > > Key: YARN-3334 > URL: https://issues.apache.org/jira/browse/YARN-3334 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, > YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, > YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334-v8.patch, YARN-3334.7.patch > > > After YARN-3039, we have service discovery mechanism to pass app-collector > service address among collectors, NMs and RM. In this JIRA, we will handle > service address setting for TimelineClients in NodeManager, and put container > metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3110: Attachment: YARN-3110.20150406-1.patch attaching patch with rebased code > Few issues in ApplicationHistory web ui > --- > > Key: YARN-3110 > URL: https://issues.apache.org/jira/browse/YARN-3110 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, timelineserver >Affects Versions: 2.6.0 >Reporter: Bibin A Chundatt >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, > YARN-3110.20150406-1.patch > > > Application state and History link wrong when Application is in unassigned > state > > 1.Configure capacity schedular with queue size as 1 also max Absolute Max > Capacity: 10.0% > (Current application state is Accepted and Unassigned from resource manager > side) > 2.Submit application to queue and check the state and link in Application > history > State= null and History link shown as N/A in applicationhistory page > Kill the same application . In timeline server logs the below is show when > selecting application link. > {quote} > 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to > read the AM container of the application attempt > appattempt_1422467063659_0007_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) > at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java
[jira] [Commented] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481393#comment-14481393 ] Jason Lowe commented on YARN-3449: -- I believe the apps in the appTokenKeepAliveMap will be recovered per my first comment, but yes the relative delays stored in that map will not match what was there before. However I'm not sure it matters that we have the exact times in there. Again when the NM re-registers it will report all active applications, and the RM will attempt to correct this on the next heartbeat. The NM will then add all apps that are still aggregating to the appTokenKeepAliveMap and report that to the RM, and the RM will delay the token removal accordingly. I don't think this changes when the token is renewed on the RM, just when the token may be cancelled. Is this JIRA tracking an actual failure that occurred or a theoretical occurrence? > Recover appTokenKeepAliveMap upon nodemanager restart > - > > Key: YARN-3449 > URL: https://issues.apache.org/jira/browse/YARN-3449 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.6.0, 2.7.0 >Reporter: Junping Du >Assignee: Junping Du > > appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application > alive after application is finished but NM still need app token to do log > aggregation (when enable security and log aggregation). > The applications are only inserted into this map when receiving > getApplicationsToCleanup() from RM heartbeat response. And RM only send this > info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM > restart work preserving should put appTokenKeepAliveMap into NMStateStore and > get recovered after restart. Without doing this, RM could terminate > application earlier, so log aggregation could be failed if security is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481371#comment-14481371 ] Zhijie Shen commented on YARN-3334: --- bq. Just filed YARN-3445 to track this issue. Yup, I think we can separate that issue. For this patch. the code comment is good for now. Will commit this patch. > [Event Producers] NM TimelineClient life cycle handling and container metrics > posting to new timeline service. > -- > > Key: YARN-3334 > URL: https://issues.apache.org/jira/browse/YARN-3334 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, > YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, > YARN-3334-v5.patch, YARN-3334-v6.patch, YARN-3334-v8.patch, YARN-3334.7.patch > > > After YARN-3039, we have service discovery mechanism to pass app-collector > service address among collectors, NMs and RM. In this JIRA, we will handle > service address setting for TimelineClients in NodeManager, and put container > metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities
[ https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481362#comment-14481362 ] Jason Lowe commented on YARN-3448: -- Thanks for the patch, Jonathan. Interesting approach, and this should drastically improve performance for retention processing. Some comments on the patch so far: I think the code would be easier to follow if we didn't abuse Map.Entry as a pair class to associate a WriteBatch to the corresponding DB. Creating a custom utility class to associate these would make the code a lot more readable than always needing to deduce that getKey() is a database and getValue() is a WriteBatch. The underlying database throws a runtime exception, and the existing leveldb store translates these to IOExceptions. I think we want to do the same here. For example, put has a try..finally block with no catch clauses yet the method says it does not throw exceptions like IOException. Arguably it should throw IOException when the database has an error. The original leveldb code had locking around entities but I don't see it here. Since updating entities often involves a read/modify/write operation on the database, are we sure it's OK to remove that synchronization? computeCheckMillis says it needs to be called synchronously, but it looks like it can be called without a lock via a number of routes, e.g.: put -> putIndex -> computeCurrentCheckMillis -> computeCheckMillis put -> putEntities -> computeCurrentCheckMillis -> computeCheckMillis These should probably be debug statements, otherwise I think they could be quite spammy in the server log. Also the latter one will always be followed by the former because of the loop and may not be that useful in practice, even at the debug level. {code} +LOG.info("Trying the db" + db); ... +LOG.info("Trying the previous db" + db); {code} This will NPE on entityUpdate if db is null, and the code explicity checks for that possibility: {code} + Map.Entry entityUpdate = entityUpdates.get(roundedStartTime); + if (entityUpdate == null) { +DB db = entitydb.getDBForStartTime(startAndInsertTime.startTime); +if (db != null) { + WriteBatch writeBatch = db.createWriteBatch(); + entityUpdate = new AbstractMap.SimpleImmutableEntry(db, writeBatch); + entityUpdates.put(roundedStartTime, entityUpdate); +}; + } + WriteBatch writeBatch = entityUpdate.getValue(); {code} In the following code we lookup relatedEntityUpdate but then after checking if it's null never use it again. I think we're supposed to be setting up relatedEntityUpdate in the block if it's null rather than re-assigning entityUpdate. Then after the null check we should be using relatedEntityUpdate rather than entityUpdate to get the proper write batch. {code} +Map.Entry relatedEntityUpdate = entityUpdates.get(relatedRoundedStartTime); +if (relatedEntityUpdate == null) { + DB db = entitydb.getDBForStartTime(relatedStartTimeLong); + if (db != null) { +WriteBatch relatedWriteBatch = db.createWriteBatch(); +entityUpdate = new AbstractMap.SimpleImmutableEntry( +db, relatedWriteBatch); +entityUpdates.put(relatedRoundedStartTime, entityUpdate); + } + ; +} +WriteBatch relatedWriteBatch = entityUpdate.getValue(); {code} This code is commented out. Should have been deleted or is there something left to do here with respect to related entitites? {code} +/* +for (EntityIdentifier relatedEntity : relatedEntitiesWithoutStartTimes) { + try { +StartAndInsertTime relatedEntityStartAndInsertTime = +getAndSetStartTime(relatedEntity.getId(), relatedEntity.getType(), +readReverseOrderedLong(revStartTime, 0), null); +if (relatedEntityStartAndInsertTime == null) { + throw new IOException("Error setting start time for related entity"); +} +byte[] relatedEntityStartTime = writeReverseOrderedLong( +relatedEntityStartAndInsertTime.startTime); + // This is the new entity, the domain should be the same +byte[] key = createDomainIdKey(relatedEntity.getId(), +relatedEntity.getType(), relatedEntityStartTime); +writeBatch.put(key, entity.getDomainId().getBytes()); +++putCount; +writeBatch.put(createRelatedEntityKey(relatedEntity.getId(), +relatedEntity.getType(), relatedEntityStartTime, +entity.getEntityId(), entity.getEntityType()), EMPTY_BYTES); +++putCount; +writeBatch.put(createEntityMarkerKey(relatedEntity.getId(), +relatedEntity.getType(), relatedEntityStartTime), +writeReverseOrderedLong(relatedEntityStartAndInsertTime +.ins
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481357#comment-14481357 ] Hadoop QA commented on YARN-3044: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723366/YARN-3044.20150406-1.patch against trunk revision 28bebc8. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7224//console This message is automatically generated. > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3044: Attachment: YARN-3044.20150406-1.patch On after thoughts, i have changed changed the approach for this jira (similar to the one mentioned by Zhijie ), because creating seperate stack was not only inducing too many changes but i was also skeptical that in future supporting removal of dependency from timelineservice project in RM project will again induce changes in seperate stack for V2 approach. So approach what i have taken is 1> SMP will exist and SMP creates RMTimelineCollector. and RM will not be aware of RMTimelineCollector. 2> Seggregrate the V1 event handler code from SMP as TimelineV1Handler . (simpler to remove V1 support it in future ) 3> add modifications in SMP such that TimelineV1Handler or RMTimelineCollector (V2) will be created based on configuration. Also appropriate Dispatchers are to be selected. I have also incorporated the changes to support RMContainer metrics based on configuration (Junping's comments). Pending tasks : * Testcases for RMTimelinecollector is not completed, as i dint want to take the approach of TestDistributedShell as tests in it are mostly checking whether the required files are created but in case of TestRMTimelinecollector, we need to check whether the Entities are properly populated, which requires Reader API and it seems to be not finalized yet. Also even to make use FileSystemTimelineWriterImpl to test requires TimelineCollectorContext and hence the dependency on YARN-3390 * As mentioned earlier AppConfig information is not completely available in RM side, hence currently have populated the Environment config available in RMApp. Shall i raise a new jira to support a method in TimelineClient interface to load Appconfig ? > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481330#comment-14481330 ] Junping Du commented on YARN-3449: -- Thanks [~jlowe] for replying with comments! I didn't quite sure about this. However, from what I learnt from the code, looks like we are renewing the delegation tokens in RM side for finishing Apps but NM still need them to do log aggregation. The way NM keep token alive for log aggregation is to send appTokenKeepAliveMap in heartbeat to RM and keep the time value updated (currentTime + 0.7~0.9 * tokenRemovalDelayMs) in every heartbeat request/response. If appTokenKeepAliveMap doesn't get recovered after NM get restarted, then NM will never add these apps in keep alive list (appsToCleanup only sent once by RM) and RM won't renew the token after the time get expired (based on last heartbeat request before NM start) because it won't receive any new messages from NM on these apps. In practical, this issues doesn't appear obviously because tokenRemovalDelayMs is often very large (10 minutes by default), and very few case that NM cannot finish log aggregation after this time (even counting NM restart time). However, we should still fix it because it making behavior of delegation token renewing inconsistent before and after NM restart (and cause bug at least theoretically). Isn't it? > Recover appTokenKeepAliveMap upon nodemanager restart > - > > Key: YARN-3449 > URL: https://issues.apache.org/jira/browse/YARN-3449 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.6.0, 2.7.0 >Reporter: Junping Du >Assignee: Junping Du > > appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application > alive after application is finished but NM still need app token to do log > aggregation (when enable security and log aggregation). > The applications are only inserted into this map when receiving > getApplicationsToCleanup() from RM heartbeat response. And RM only send this > info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM > restart work preserving should put appTokenKeepAliveMap into NMStateStore and > get recovered after restart. Without doing this, RM could terminate > application earlier, so log aggregation could be failed if security is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1376) NM need to notify the log aggregation status to RM through Node heartbeat
[ https://issues.apache.org/jira/browse/YARN-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481327#comment-14481327 ] Hadoop QA commented on YARN-1376: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723354/YARN-1376.2015-04-06.patch against trunk revision 53959e6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7223//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7223//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7223//console This message is automatically generated. > NM need to notify the log aggregation status to RM through Node heartbeat > - > > Key: YARN-1376 > URL: https://issues.apache.org/jira/browse/YARN-1376 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-1376.1.patch, YARN-1376.2.patch, YARN-1376.2.patch, > YARN-1376.2015-04-04.patch, YARN-1376.2015-04-06.patch, YARN-1376.3.patch, > YARN-1376.4.patch > > > Expose a client API to allow clients to figure if log aggregation is > complete. The ticket is used to track the changes on NM side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481249#comment-14481249 ] Jason Lowe commented on YARN-3449: -- While the NM is aggregating logs the application is still present in the state store, and the application should be recovered as still active after an NM restart. The NM will then register with those applications listed as still active. When the RM later tells the NM that those applications should be cleaned up, the applications should be added to the keep alive list as normal. Thus I think the appTokenKeepAliveMap state should already be recovered properly without explicitly persisting it -- or am I missing something? > Recover appTokenKeepAliveMap upon nodemanager restart > - > > Key: YARN-3449 > URL: https://issues.apache.org/jira/browse/YARN-3449 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.6.0, 2.7.0 >Reporter: Junping Du >Assignee: Junping Du > > appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application > alive after application is finished but NM still need app token to do log > aggregation (when enable security and log aggregation). > The applications are only inserted into this map when receiving > getApplicationsToCleanup() from RM heartbeat response. And RM only send this > info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM > restart work preserving should put appTokenKeepAliveMap into NMStateStore and > get recovered after restart. Without doing this, RM could terminate > application earlier, so log aggregation could be failed if security is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1376) NM need to notify the log aggregation status to RM through Node heartbeat
[ https://issues.apache.org/jira/browse/YARN-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1376: Attachment: YARN-1376.2015-04-06.patch > NM need to notify the log aggregation status to RM through Node heartbeat > - > > Key: YARN-1376 > URL: https://issues.apache.org/jira/browse/YARN-1376 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-1376.1.patch, YARN-1376.2.patch, YARN-1376.2.patch, > YARN-1376.2015-04-04.patch, YARN-1376.2015-04-06.patch, YARN-1376.3.patch, > YARN-1376.4.patch > > > Expose a client API to allow clients to figure if log aggregation is > complete. The ticket is used to track the changes on NM side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1376) NM need to notify the log aggregation status to RM through Node heartbeat
[ https://issues.apache.org/jira/browse/YARN-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481239#comment-14481239 ] Xuan Gong commented on YARN-1376: - fix -1 on release audit. -1 on findbugs is not related > NM need to notify the log aggregation status to RM through Node heartbeat > - > > Key: YARN-1376 > URL: https://issues.apache.org/jira/browse/YARN-1376 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-1376.1.patch, YARN-1376.2.patch, YARN-1376.2.patch, > YARN-1376.2015-04-04.patch, YARN-1376.2015-04-06.patch, YARN-1376.3.patch, > YARN-1376.4.patch > > > Expose a client API to allow clients to figure if log aggregation is > complete. The ticket is used to track the changes on NM side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481225#comment-14481225 ] Daryn Sharp commented on YARN-3055: --- Correctly handling the "don't cancel" setting for jobs submitting job has been a recurring issue. We're internally testing a small patch to continue renewing until all jobs using the token(s) have finished. Handling the auto-fetch of proxy tokens proved a bit more difficult so I need to complete the internal patch. I can take this over or post a partial patch if [~hitliuyi] would like to finish it. > The token is not renewed properly if it's shared by jobs (oozie) in > DelegationTokenRenewer > -- > > Key: YARN-3055 > URL: https://issues.apache.org/jira/browse/YARN-3055 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Blocker > Attachments: YARN-3055.001.patch, YARN-3055.002.patch > > > After YARN-2964, there is only one timer to renew the token if it's shared by > jobs. > In {{removeApplicationFromRenewal}}, when going to remove a token, and the > token is shared by other jobs, we will not cancel the token. > Meanwhile, we should not cancel the _timerTask_, also we should not remove it > from {{allTokens}}. Otherwise for the existing submitted applications which > share this token will not get renew any more, and for new submitted > applications which share this token, the token will be renew immediately. > For example, we have 3 applications: app1, app2, app3. And they share the > token1. See following scenario: > *1).* app1 is submitted firstly, then app2, and then app3. In this case, > there is only one token renewal timer for token1, and is scheduled when app1 > is submitted > *2).* app1 is finished, then the renewal timer is cancelled. token1 will not > be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui
[ https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481163#comment-14481163 ] Xuan Gong commented on YARN-3110: - [~Naganarasimha] The patch looks good, but it can not apply to trunk/branch-2 any more. Could you rebase it , please ? > Few issues in ApplicationHistory web ui > --- > > Key: YARN-3110 > URL: https://issues.apache.org/jira/browse/YARN-3110 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, timelineserver >Affects Versions: 2.6.0 >Reporter: Bibin A Chundatt >Assignee: Naganarasimha G R >Priority: Minor > Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch > > > Application state and History link wrong when Application is in unassigned > state > > 1.Configure capacity schedular with queue size as 1 also max Absolute Max > Capacity: 10.0% > (Current application state is Accepted and Unassigned from resource manager > side) > 2.Submit application to queue and check the state and link in Application > history > State= null and History link shown as N/A in applicationhistory page > Kill the same application . In timeline server logs the below is show when > selecting application link. > {quote} > 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to > read the AM container of the application attempt > appattempt_1422467063659_0007_01. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38) > at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$St
[jira] [Created] (YARN-3451) Add start time and end time in ApplicationAttemptReport and display the same in RMAttemptBlock WebUI
Rohith created YARN-3451: Summary: Add start time and end time in ApplicationAttemptReport and display the same in RMAttemptBlock WebUI Key: YARN-3451 URL: https://issues.apache.org/jira/browse/YARN-3451 Project: Hadoop YARN Issue Type: Improvement Components: api, webapp Reporter: Rohith Assignee: Rohith Unlike ApplicationReport and ApplicationBlock has *Started:* and *Elapsed:* time, It would be usefull if start time and Elapsed is sent in ApplicationAttemptReport and display in ApplicationAttemptBlock. This gives granular debugging ability when analyzing issue with multiple attempt failure like attempt timedout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3450) Application Master killed RPC Port AM Host not shown in CLI
[ https://issues.apache.org/jira/browse/YARN-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481040#comment-14481040 ] Bibin A Chundatt commented on YARN-3450: /applicationhistory/appattempt/appattempt_1428321793042_0005_01 In Web UI the same is shown properly Application Attempt Overview State FAILED Master Containercontainer_1428321793042_0005_01_01 Node: host-10-19-92-143:49820 Tracking URL: History > Application Master killed RPC Port AM Host not shown in CLI > --- > > Key: YARN-3450 > URL: https://issues.apache.org/jira/browse/YARN-3450 > Project: Hadoop YARN > Issue Type: Bug > Components: client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Naganarasimha G R >Priority: Minor > > Start Sleep job > Kill application Master process > Check status of application Attempt > When Application Master killed RPC Port , AM Host not shown in CLI > {quote} > dsperf@host-10-19-92-117:~/HADOPV1R2/install/hadoop/nodemanager/bin> > {color:red} ./yarn applicationattempt -status > appattempt_1428321793042_0005_01 {color} > 15/04/06 13:40:52 INFO impl.TimelineClientImpl: Timeline service address: > http://10.19.92.127:8188/ws/v1/timeline/ > 15/04/06 13:40:52 INFO client.AHSProxy: Connecting to Application History > server at /10.19.92.127:45034 > 15/04/06 13:40:53 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Application Attempt Report : > ApplicationAttempt-Id : appattempt_1428321793042_0005_01 > State : FAILED > AMContainer : container_1428321793042_0005_01_01 > Tracking-URL : > http://host-10-19-92-127:45020/cluster/app/application_1428321793042_0005 > {color:red} > RPC Port : -1 > AM Host : N/A > {color} > Diagnostics : AM Container for appattempt_1428321793042_0005_01 > exited with exitCode: 137 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3450) Application Master killed RPC Port AM Host not shown in CLI
Bibin A Chundatt created YARN-3450: -- Summary: Application Master killed RPC Port AM Host not shown in CLI Key: YARN-3450 URL: https://issues.apache.org/jira/browse/YARN-3450 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Minor Start Sleep job Kill application Master process Check status of application Attempt When Application Master killed RPC Port , AM Host not shown in CLI {quote} dsperf@host-10-19-92-117:~/HADOPV1R2/install/hadoop/nodemanager/bin> {color:red} ./yarn applicationattempt -status appattempt_1428321793042_0005_01 {color} 15/04/06 13:40:52 INFO impl.TimelineClientImpl: Timeline service address: http://10.19.92.127:8188/ws/v1/timeline/ 15/04/06 13:40:52 INFO client.AHSProxy: Connecting to Application History server at /10.19.92.127:45034 15/04/06 13:40:53 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Application Attempt Report : ApplicationAttempt-Id : appattempt_1428321793042_0005_01 State : FAILED AMContainer : container_1428321793042_0005_01_01 Tracking-URL : http://host-10-19-92-127:45020/cluster/app/application_1428321793042_0005 {color:red} RPC Port : -1 AM Host : N/A {color} Diagnostics : AM Container for appattempt_1428321793042_0005_01 exited with exitCode: 137 {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently
[ https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481037#comment-14481037 ] Tsuyoshi Ozawa commented on YARN-2666: -- +1. It's better to call scheduler.continuousSchedulingAttempt() instead of waiting for scheduling. Committing this shortly. > TestFairScheduler.testContinuousScheduling fails Intermittently > --- > > Key: YARN-2666 > URL: https://issues.apache.org/jira/browse/YARN-2666 > Project: Hadoop YARN > Issue Type: Test > Components: scheduler >Reporter: Tsuyoshi Ozawa >Assignee: zhihai xu > Attachments: YARN-2666.000.patch > > > The test fails on trunk. > {code} > Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler > testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler) > Time elapsed: 0.582 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)