[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] q79969786 updated YARN-2198: Description: YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. was: YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Wi
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151466#comment-14151466 ] Jun Gong commented on YARN-2617: [~jianhe], thank you for the review! {quote} I think we should explicitly check if apps are at FINISHING_CONTAINERS_WAIT/APPLICATION_RESOURCES_CLEANINGUP/FINISHED state. {quote} My concern is that we will need to modify the code when we add a new state for ApplicationImpl. It will be OK if it is not a problem. BTW: is there any case that APP has containers but APP is not in RUNNING state? {quote} The code needs to be moved inside the following check {{ if (containerStatus.getState().equals(ContainerState.COMPLETE))}} ... {quote} OK. I will change it. And I will add an unit test. > NM does not need to send finished container whose APP is not running to RM > -- > > Key: YARN-2617 > URL: https://issues.apache.org/jira/browse/YARN-2617 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.6.0 > > Attachments: YARN-2617.patch > > > We([~chenchun]) are testing RM work preserving restart and found the > following logs when we ran a simple MapReduce task "PI". NM continuously > reported completed containers whose Application had already finished while AM > had finished. > {code} > 2014-09-26 17:00:42,228 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:42,228 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:43,230 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:43,230 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:44,233 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:44,233 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {code} > In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean > up already completed applications. But it will only remove appId from > 'app.context.getApplications()' when ApplicaitonImpl received evnet > 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might > receive this event for a long time or could not receive. > * For NonAggregatingLogHandler, it wait for > YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, > then it will be scheduled to delete Application logs and send the event. > * For LogAggregationService, it might fail(e.g. if user does not have HDFS > write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Description: YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. was: YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which i
[jira] [Updated] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2493: - Attachment: YARN-2493.patch Hi [~vinodkv], Thanks for your careful review, all comments are make sense to me. Attached a new patch according to your suggestions. Wangda > [YARN-796] API changes for users > > > Key: YARN-2493 > URL: https://issues.apache.org/jira/browse/YARN-2493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, > YARN-2493.patch > > > This JIRA includes API changes for users of YARN-796, like changes in > {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common > part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2493: - Attachment: (was: YARN-2493.patch) > [YARN-796] API changes for users > > > Key: YARN-2493 > URL: https://issues.apache.org/jira/browse/YARN-2493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, > YARN-2493.patch > > > This JIRA includes API changes for users of YARN-796, like changes in > {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common > part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2493: - Attachment: YARN-2493.patch > [YARN-796] API changes for users > > > Key: YARN-2493 > URL: https://issues.apache.org/jira/browse/YARN-2493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, > YARN-2493.patch > > > This JIRA includes API changes for users of YARN-796, like changes in > {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common > part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151528#comment-14151528 ] Hadoop QA commented on YARN-2493: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671760/YARN-2493.patch against trunk revision b38e52b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color}. The applied patch generated 1281 javac compiler warnings (more than the trunk's current 1265 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5169//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5169//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5169//console This message is automatically generated. > [YARN-796] API changes for users > > > Key: YARN-2493 > URL: https://issues.apache.org/jira/browse/YARN-2493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, > YARN-2493.patch > > > This JIRA includes API changes for users of YARN-796, like changes in > {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common > part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312.2-2.patch Let me attach same patch again. > Marking ContainerId#getId as deprecated > --- > > Key: YARN-2312 > URL: https://issues.apache.org/jira/browse/YARN-2312 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, > YARN-2312.2-2.patch, YARN-2312.2.patch > > > {{ContainerId#getId}} will only return partial value of containerId, only > sequence number of container id without epoch, after YARN-2229. We should > mark {{ContainerId#getId}} as deprecated and use > {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151658#comment-14151658 ] Wangda Tan commented on YARN-2494: -- Hi [~vinodkv] and [~cwelch], Thanks for reply! Still working on handling your last comments, will upload patch soon. Regarding method name of NodeLabelManager, I think following suggestion make sense to me: bq. What I really want is to convey is that these are just system recognized nodelabels as opposed to node-lables that are actually mapped against a node. How about addToNodeLabelsCollection(), removeFromNodeLabelsCollection(), addLabelsToNode() and removeLabelsFromNode(). The point about addToNodeLabelsCollection() is that it clearly conveys that there is a NodeLabelsCollection - a set of node-labels known by the system. And regarding bq. Once you have the store abstraction, this will be less of a problem? Clearly NodeLabelsManager is not something that the client needs access to? I think it still has problem: Even if we have store abstraction, we still need some logic to guarantee labels being added are valid (e.g. we need check if a label existed in collection, and label existed in node when we trying to remove some labels from a node). That makes we need put a greater chunk of logic to the store abstraction -- it isn't a simple store abstraction if we do this. I suggest to keep it in common to make node label major logic are live together. Thanks, Wangda > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, > YARN-2494.patch, YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151668#comment-14151668 ] Hadoop QA commented on YARN-2312: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671773/YARN-2312.2-2.patch against trunk revision b38e52b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 16 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5170//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5170//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5170//console This message is automatically generated. > Marking ContainerId#getId as deprecated > --- > > Key: YARN-2312 > URL: https://issues.apache.org/jira/browse/YARN-2312 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, > YARN-2312.2-2.patch, YARN-2312.2.patch > > > {{ContainerId#getId}} will only return partial value of containerId, only > sequence number of container id without epoch, after YARN-2229. We should > mark {{ContainerId#getId}} as deprecated and use > {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2617: --- Attachment: YARN-2617.2.patch > NM does not need to send finished container whose APP is not running to RM > -- > > Key: YARN-2617 > URL: https://issues.apache.org/jira/browse/YARN-2617 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.6.0 > > Attachments: YARN-2617.2.patch, YARN-2617.patch > > > We([~chenchun]) are testing RM work preserving restart and found the > following logs when we ran a simple MapReduce task "PI". NM continuously > reported completed containers whose Application had already finished while AM > had finished. > {code} > 2014-09-26 17:00:42,228 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:42,228 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:43,230 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:43,230 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:44,233 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > 2014-09-26 17:00:44,233 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Null container completed... > {code} > In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean > up already completed applications. But it will only remove appId from > 'app.context.getApplications()' when ApplicaitonImpl received evnet > 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might > receive this event for a long time or could not receive. > * For NonAggregatingLogHandler, it wait for > YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, > then it will be scheduled to delete Application logs and send the event. > * For LogAggregationService, it might fail(e.g. if user does not have HDFS > write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151706#comment-14151706 ] Jason Lowe commented on YARN-1769: -- +1 lgtm. Committing this. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151726#comment-14151726 ] Hudson commented on YARN-1769: -- FAILURE: Integrated in Hadoop-trunk-Commit #6135 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6135/]) YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 2.6.0 > > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151796#comment-14151796 ] Junping Du commented on YARN-2613: -- Thanks [~jianhe] for the patch. I am reviewing your patch, and some initiative comments below. More comments may come later. {code} - public static final int DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS = + public static final long DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS = 15 * 60 * 1000; + public static final int DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS = + 15 * 60 * 1000; + public static final long DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS + = 10 * 1000; {code} I think it is better to keep consistent to use int or long for time intervals or wait. IMO, int should be fine enough as it supports up to (2 ^ 31) millseconds ~ 50 days. {code} -//TO DO: after HADOOP-9576, IOException can be changed to EOFException -exceptionToPolicyMap.put(IOException.class, retryPolicy); {code} Do we have plan to get HADOOP-9576 in? If yes, shall we keep the todo comments here? > NMClient doesn't have retries for supporting rolling-upgrades > - > > Key: YARN-2613 > URL: https://issues.apache.org/jira/browse/YARN-2613 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2613.1.patch, YARN-2613.2.patch > > > While NM is rolling upgrade, client should retry NM until it comes up. This > jira is to add a NMProxy (similar to RMProxy) with retry implementation to > support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2606: Attachment: YARN-2606.patch Refining the patch to remove the unwanted serviceInit() as all the work is done in serviceStart() > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2606: Attachment: YARN-2606.patch Yet some more refining. Attached updated patch. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, > YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151851#comment-14151851 ] Hadoop QA commented on YARN-2606: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671803/YARN-2606.patch against trunk revision 4666440. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5171//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5171//console This message is automatically generated. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, > YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151914#comment-14151914 ] Jian He commented on YARN-2613: --- bq. I think it is better to keep consistent to use int or long good catch, I changed the one for RMProxy, but missed this. bq. Do we have plan to get HADOOP-9576 in? If yes, shall we keep the todo comments here? I forgot my initial intent to add this comment. As now I followed FailoverOnNetworkExceptionRetry for the exception-retry policy, I found maybe we don't need to do this for now. > NMClient doesn't have retries for supporting rolling-upgrades > - > > Key: YARN-2613 > URL: https://issues.apache.org/jira/browse/YARN-2613 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2613.1.patch, YARN-2613.2.patch > > > While NM is rolling upgrade, client should retry NM until it comes up. This > jira is to add a NMProxy (similar to RMProxy) with retry implementation to > support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151977#comment-14151977 ] Karthik Kambatla commented on YARN-2179: [~vinodkv] - do you have any further comments on this? > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, > YARN-2179-trunk-v9.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2618) Add API support for disk I/O resources
Wei Yan created YARN-2618: - Summary: Add API support for disk I/O resources Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Subtask of YARN-2139. Add API support for introducing disk I/O as the 3rd type resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2619) NodeManager: Add cgroups support for disk I/O isolation
Wei Yan created YARN-2619: - Summary: NodeManager: Add cgroups support for disk I/O isolation Key: YARN-2619 URL: https://issues.apache.org/jira/browse/YARN-2619 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2620) FairScheduler: Add disk I/O resource to the DRF implementation
Wei Yan created YARN-2620: - Summary: FairScheduler: Add disk I/O resource to the DRF implementation Key: YARN-2620 URL: https://issues.apache.org/jira/browse/YARN-2620 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2610: --- Summary: Hamlet should close table tags (was: Hamlet doesn't close table tags) > Hamlet should close table tags > -- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152042#comment-14152042 ] Remus Rusanu commented on YARN-2198: the last QA -1 is for delta.10.patch, which is not trunk diff. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152075#comment-14152075 ] Hadoop QA commented on YARN-2606: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671811/YARN-2606.patch against trunk revision b3d5d26. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5172//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5172//console This message is automatically generated. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, > YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152078#comment-14152078 ] Craig Welch commented on YARN-2494: --- Not to dither about names - but "Collection" is still not terribly clear to me (overly generic), I was thinking previously about "Cluster" as the differentiator, so: addToClusterNodeLabels(), removeFromClusterNodeLabels(), addLabelsToNode() and removeLabelsFromNode(). I think this conveys the different notions of what the operations are applying to in a pretty clear way. Thoughts? > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, > YARN-2494.patch, YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152077#comment-14152077 ] Jonathan Eagles commented on YARN-2606: --- +1. Will commit at the end of the day in case any one else has comments. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, > YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152084#comment-14152084 ] Vinod Kumar Vavilapalli commented on YARN-2179: --- Looking now.. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, > YARN-2179-trunk-v9.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152103#comment-14152103 ] Vinod Kumar Vavilapalli commented on YARN-2179: --- Looks so much better now. One minor suggestion - in the test, instead of overriding all of YarnClient, you could simply mock it to override behaviour of only those methods that you are interested in. +1 otherwise. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, > YARN-2179-trunk-v9.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152137#comment-14152137 ] Jian Fang commented on YARN-1680: - Hi, any update on the fix? We saw quick some jobs failed due to this issue. > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152153#comment-14152153 ] Eric Payne commented on YARN-2056: -- [~leftnoteasy]. Thanks again for helping to review this patch. Have you had a chance to look over the updated changes? > Disable preemption at Queue level > - > > Key: YARN-2056 > URL: https://issues.apache.org/jira/browse/YARN-2056 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal >Assignee: Eric Payne > Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, > YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, > YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, > YARN-2056.201409232329.txt, YARN-2056.201409242210.txt > > > We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152154#comment-14152154 ] Chen He commented on YARN-1680: --- Thank you for remaindering me, [~john.jian.fang]. I will post the updated patch before end of tomorrow. > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152156#comment-14152156 ] Jian Fang commented on YARN-1680: - Thanks. Looking forward to your patch. > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152160#comment-14152160 ] Mit Desai commented on YARN-2610: - Why is the change specific to some tags and not the others? > Hamlet should close table tags > -- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152176#comment-14152176 ] Ray Chiang commented on YARN-2610: -- I would have been fine with changing all the tags to close cleanly, except for the feedback from MAPREDUCE-2993. So, I limited these changes to just the table rendering ones--which tends to cause the most problems anyhow. Or is there some table related tag that I missed? > Hamlet should close table tags > -- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152206#comment-14152206 ] Karthik Kambatla commented on YARN-2610: I just ran all YARN tests with the latest patch to be safe. None of the test failures are related. +1. I ll commit this later today if no one objects. > Hamlet should close table tags > -- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152219#comment-14152219 ] Mit Desai commented on YARN-2610: - [~rchiang], I did not see the comments on that MAPREDUCE-2993 before. Just wanted to know the reason behind leaving some tags open. The patch looks good to me. +1 (non-binding) > Hamlet should close table tags > -- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152270#comment-14152270 ] Vinod Kumar Vavilapalli commented on YARN-2494: --- bq. I think it still has problem: Even if we have store abstraction, we still need some logic to guarantee labels being added are valid (e.g. we need check if a label existed in collection, and label existed in node when we trying to remove some labels from a node). Then that validation code needs to get pulled out in a common layer. My goal it not put the entire NodelabelsManager in yarn-common - it just doesn't belong there. bq. How about addToNodeLabelsCollection(), removeFromNodeLabelsCollection(), addLabelsToNode() and removeLabelsFromNode() bq. addToClusterNodeLabels(), removeFromClusterNodeLabels(), addLabelsToNode() and removeLabelsFromNode(). [~leftnoteasy], [~cwelch], I'm okay with either of the above. Or should we call it {{ClusterNodeLabelsCollection}}? :) > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, > YARN-2494.patch, YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152301#comment-14152301 ] Anubhav Dhoot commented on YARN-1879: - The patch needs to be updated > Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol > --- > > Key: YARN-1879 > URL: https://issues.apache.org/jira/browse/YARN-1879 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Tsuyoshi OZAWA >Priority: Critical > Attachments: YARN-1879.1.patch, YARN-1879.1.patch, > YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, > YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, > YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, > YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152307#comment-14152307 ] Jason Lowe commented on YARN-90: Thanks for updating the patch, Varun. bq. I've changed it to "Disk(s) health report: ". My only concern with this is that there might be scripts looking for the "Disk(s) failed" log line for monitoring. What do you think? If that's true then the code should bother to do a diff between the old disk list and the new one, logging which disks turned bad using the "Disk(s) failed" line and which disks became healthy with some other log message. bq. Directories are only cleaned up during startup. The code tests for existence of the directories and the correct permissions. This does mean that container directories left behind for any reason won't get cleaned up unit the NodeManager is restarted. Is that ok? This could still be problematic for the NM work-preserving restart case, as we could try to delete an entire disk tree with active containers on it due to a hiccup when the NM restarts. I think a better approach is a periodic cleanup scan that looks for directories under yarn-local and yarn-logs that shouldn't be there. This could be part of the health check scan or done separately. That way we don't have to wait for a disk to turn good or bad to catch leaked entities on the disk due to some hiccup. Sorta like an fsck for the NM state on disk. That is best done as a separate JIRA, as I think this functionality is still an incremental improvement without it. Other comments: checkDirs unnecessarily calls union(errorDirs, fullDirs) twice. isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the free space is under the limit. getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments like the other methods. Nit: The union utility function doesn't technically perform a union but rather a concatenation, and it'd be a little clearer if the name reflected that. Also the function should leverage the fact that it knows how big the ArrayList will be after the operations and give it the appropriate hint to its constructor to avoid reallocations. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user
[ https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152313#comment-14152313 ] Zhijie Shen commented on YARN-2446: --- bq. Get domains API: "If callerUGI is not the owner or the admin of the domain, we need to hide the details from him, and only allow him to see the ID": Why is that, I think we should just not allow non-owners to see anything. Is there a user-case for this? bq. Based on the above decision, TestTimelineWebServices.testGetDomainsYarnACLsEnabled() should be changed to either validate that only IDs are visible or nothing is visible. The rationale before is to let users to check whether the namespace Id is occupied or not before putting one. Talked to vindo offline, since it cannot save the race condition of multiple putting requests anyway, let's simplify the behavior as is suggested above. It's not related to code in this patch. Let me file a separate Jira for it. bq. Shouldn't the server completely own DEFAULT_DOMAIN_ID, instead of letting anyone create it with potentially arbitrary permission? Yes, DEFAULT_DOMAIN_ID is owned by the timeline server. When TimelineDataManager is constructed, if the default domain is not created before, the timeline server is going to create one. Users can not create or modify the domain with DEFAULT_DOMAIN_ID. bq. testGetEntitiesWithYarnACLsEnabled() The test cases seem to be problematic. I've updated these test cases and add the validation of cross-domain entity relationship. One more issue I've noticed that after this patch, we should make RM put the application metrics into a secured domain instead of the default one. Will file a Jira for it as well. > Using TimelineNamespace to shield the entities of a user > > > Key: YARN-2446 > URL: https://issues.apache.org/jira/browse/YARN-2446 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch > > > Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the > entities, preventing them from being accessed or affected by other users' > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2446) Using TimelineNamespace to shield the entities of a user
[ https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2446: -- Attachment: YARN-2446.3.patch > Using TimelineNamespace to shield the entities of a user > > > Key: YARN-2446 > URL: https://issues.apache.org/jira/browse/YARN-2446 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch > > > Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the > entities, preventing them from being accessed or affected by other users' > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)
Zhijie Shen created YARN-2621: - Summary: Simplify the output when the user doesn't have the access for getDomain(s) Key: YARN-2621 URL: https://issues.apache.org/jira/browse/YARN-2621 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Per discussion in [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272], we should simply reject the user if it doesn't have access the domain(s), instead of returning the entity without detail information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain
[ https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2622: -- Component/s: timelineserver > RM should put the application related timeline data into a secured domain > - > > Key: YARN-2622 > URL: https://issues.apache.org/jira/browse/YARN-2622 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the > application related timeline data is put into the default domain. It is not > secured. We should let RM to choose a secured domain to put the system > metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2622) RM should put the application related timeline data into a secured domain
Zhijie Shen created YARN-2622: - Summary: RM should put the application related timeline data into a secured domain Key: YARN-2622 URL: https://issues.apache.org/jira/browse/YARN-2622 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the application related timeline data is put into the default domain. It is not secured. We should let RM to choose a secured domain to put the system metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain
[ https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2622: -- Affects Version/s: 2.6.0 > RM should put the application related timeline data into a secured domain > - > > Key: YARN-2622 > URL: https://issues.apache.org/jira/browse/YARN-2622 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the > application related timeline data is put into the default domain. It is not > secured. We should let RM to choose a secured domain to put the system > metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain
[ https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2622: -- Target Version/s: 2.6.0 > RM should put the application related timeline data into a secured domain > - > > Key: YARN-2622 > URL: https://issues.apache.org/jira/browse/YARN-2622 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the > application related timeline data is put into the default domain. It is not > secured. We should let RM to choose a secured domain to put the system > metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152348#comment-14152348 ] Jonathan Eagles commented on YARN-2606: --- Committed to trunk and branch-2 > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Fix For: 2.6.0 > > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, > YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152352#comment-14152352 ] Hudson commented on YARN-2606: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6146 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6146/]) YARN-2606. Application History Server tries to access hdfs before doing secure login (Mit Desai via jeagles) (jeagles: rev e10eeaabce2a21840cfd5899493c9d2d4fe2e322) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Fix For: 2.6.0 > > Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, > YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152355#comment-14152355 ] Chris Trezzo commented on YARN-2179: [~vinodkv] Mocking YarnClient seems to be tricky due to it being an AbstractService. Would extending YarnClientImpl and only overriding methods I need to stub be a more reasonable approach? For this approach I would need to make the serviceStart and serviceStop methods in YarnClientImpl publicly visible for testing. It is still a little tricky due to the serviceStart and serviceStop methods of YarnClientImpl using ClientRMProxy. That is originally why I decided to just create a different dummy YarnClient implementation. Any thoughts on these alternative approaches, or am I just missing an easy way to mock YarnClient (which is highly possible)? > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, > YARN-2179-trunk-v9.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user
[ https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152362#comment-14152362 ] Hadoop QA commented on YARN-2446: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671870/YARN-2446.3.patch against trunk revision 7f0efe9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5173//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5173//console This message is automatically generated. > Using TimelineNamespace to shield the entities of a user > > > Key: YARN-2446 > URL: https://issues.apache.org/jira/browse/YARN-2446 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch > > > Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the > entities, preventing them from being accessed or affected by other users' > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152360#comment-14152360 ] Karthik Kambatla commented on YARN-2566: We should probably have the same mechanism of picking directories in both the default and linux container-executors. It appears LCE picks these at random. Can we do the same here? I understand picking directories at random might result in a skew due to not-so-random randomness or different applications localizing different sizes of data. May be, in the future, we could pick the directory with most available space? > IOException happen in startLocalizer of DefaultContainerExecutor due to not > enough disk space for the first localDir. > - > > Key: YARN-2566 > URL: https://issues.apache.org/jira/browse/YARN-2566 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2566.000.patch, YARN-2566.001.patch > > > startLocalizer in DefaultContainerExecutor will only use the first localDir > to copy the token file, if the copy is failed for first localDir due to not > enough disk space in the first localDir, the localization will be failed even > there are plenty of disk space in other localDirs. We see the following error > for this case: > {code} > 2014-09-13 23:33:25,171 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to > create app directory > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > java.io.IOException: mkdir of > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,185 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.FileNotFoundException: File > file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at > org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) > at > org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344) > at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,186 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container
[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2387: Attachment: YARN-2387.patch Updated the patch > Resource Manager crashes with NPE due to lack of synchronization > > > Key: YARN-2387 > URL: https://issues.apache.org/jira/browse/YARN-2387 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 2.5.0 >Reporter: Mit Desai >Assignee: Mit Desai >Priority: Blocker > Attachments: YARN-2387.patch, YARN-2387.patch > > > We recently came across a 0.23 RM crashing with an NPE. Here is the > stacktrace for it. > {noformat} > 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) > at > org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) > at java.lang.Thread.run(Thread.java:722) > 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} > On investigating a on the issue we found that the ContainerStatusPBImpl has > methods that are called by different threads and are not synchronized. Even > the 2.X code looks alike. > We need to make these methods synchronized so that we do not encounter this > problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.
zhihai xu created YARN-2623: --- Summary: Linux container executor only use the first local directory to copy token file in container-executor.c. Key: YARN-2623 URL: https://issues.apache.org/jira/browse/YARN-2623 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Environment: Linux container executor only use the first local directory to copy token file in container-executor.c. Reporter: zhihai xu Assignee: zhihai xu Linux container executor only use the first local directory to copy token file in container-executor.c. if It failed to copy token file to the first local directory, the localization failure event will happen. Even though it can copy token file to the other local directory successfully. The correct way should be to copy token file to the next local directory if the first one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152428#comment-14152428 ] zhihai xu commented on YARN-2566: - For linux container-executors, it is done at C file container-executor.c: It also pick the first directory to copy the token file: see the following code in container-executor.c: {code} char *primary_app_dir = NULL; for(nm_root=local_dirs; *nm_root != NULL; ++nm_root) { char *app_dir = get_app_directory(*nm_root, user, app_id); if (app_dir == NULL) { // try the next one } else if (mkdirs(app_dir, permissions) != 0) { free(app_dir); } else if (primary_app_dir == NULL) { primary_app_dir = app_dir; } else { free(app_dir); } } char *cred_file_name = concatenate("%s/%s", "cred file", 2, primary_app_dir, basename(nmPrivate_credentials_file_copy)); if (copy_file(cred_file, nmPrivate_credentials_file, cred_file_name, S_IRUSR|S_IWUSR) != 0){ free(nmPrivate_credentials_file_copy); return -1; } {code} I created a new jira YARN-2623 for LCE. > IOException happen in startLocalizer of DefaultContainerExecutor due to not > enough disk space for the first localDir. > - > > Key: YARN-2566 > URL: https://issues.apache.org/jira/browse/YARN-2566 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2566.000.patch, YARN-2566.001.patch > > > startLocalizer in DefaultContainerExecutor will only use the first localDir > to copy the token file, if the copy is failed for first localDir due to not > enough disk space in the first localDir, the localization will be failed even > there are plenty of disk space in other localDirs. We see the following error > for this case: > {code} > 2014-09-13 23:33:25,171 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to > create app directory > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > java.io.IOException: mkdir of > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,185 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.FileNotFoundException: File > file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at > org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) > at > org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344) > at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) > at org.apache.hadoop.fs.FileContext$Util.copy(FileC
[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152446#comment-14152446 ] Hadoop QA commented on YARN-2387: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671880/YARN-2387.patch against trunk revision c88c6c5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5174//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5174//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5174//console This message is automatically generated. > Resource Manager crashes with NPE due to lack of synchronization > > > Key: YARN-2387 > URL: https://issues.apache.org/jira/browse/YARN-2387 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 2.5.0 >Reporter: Mit Desai >Assignee: Mit Desai >Priority: Blocker > Attachments: YARN-2387.patch, YARN-2387.patch > > > We recently came across a 0.23 RM crashing with an NPE. Here is the > stacktrace for it. > {noformat} > 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) > at > org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) > at java.lang.Thread.run(Thread.java:722) > 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} > On investigating a on the issue we found that the ContainerStatusPBImpl has > methods that are called by different threads and are not synchronized. Even > the 2.X code looks alike. > We need to make these methods synchronized so that we do not encounter this > problem in future. -- This messag
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152483#comment-14152483 ] Anubhav Dhoot commented on YARN-1879: - Nit in ProtocolHATestBase > method will be re-entry method will be re-entered >the entire logic test. the entire logic of the test? >APIs that added trigger flag. APIs that added Idempotent/AtOnce annotation? Looks good otherwise > Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol > --- > > Key: YARN-1879 > URL: https://issues.apache.org/jira/browse/YARN-1879 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Tsuyoshi OZAWA >Priority: Critical > Attachments: YARN-1879.1.patch, YARN-1879.1.patch, > YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, > YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, > YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, > YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152486#comment-14152486 ] Zhijie Shen commented on YARN-2527: --- The patch almost looks good to me, in particular the additional test cases for ApplicationACLsManager. Just one nit: 1. The logic here is a bit counter-intuitive. Can we just assign acls.get(applicationAccessType) to applicationACL only when it is not null? {code} applicationACL = acls.get(applicationAccessType); if (applicationACL == null) { if (LOG.isDebugEnabled()) { LOG.debug("ACL not found for access-type " + applicationAccessType + " for application " + applicationId + " owned by " + applicationOwner + ". Using default [" + YarnConfiguration.DEFAULT_YARN_APP_ACL + "]"); } applicationACL = DEFAULT_YARN_APP_ACL; {code} > NPE in ApplicationACLsManager > - > > Key: YARN-2527 > URL: https://issues.apache.org/jira/browse/YARN-2527 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Benoy Antony >Assignee: Benoy Antony > Attachments: YARN-2527.patch, YARN-2527.patch > > > NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. > The relevant stacktrace snippet from the ResourceManager logs is as below > {code} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {code} > This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152503#comment-14152503 ] Naganarasimha G R commented on YARN-2301: - Attaching patch with corrected test cases. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > Attachments: YARN-2301.01.patch > > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152504#comment-14152504 ] Craig Welch commented on YARN-1063: --- When looking this over to pickup context for 2198, I noticed a couple things: libwinutils.c CreateLogonForUser - confusing name, makes me think a new account is being created - CreateLogonTokenForUser? LogonUser? TestWinUtils - can we add testing specific to security? > Winutils needs ability to create task as domain user > > > Key: YARN-1063 > URL: https://issues.apache.org/jira/browse/YARN-1063 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Environment: Windows >Reporter: Kyle Leckie >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, > YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch > > > h1. Summary: > Securing a Hadoop cluster requires constructing some form of security > boundary around the processes executed in YARN containers. Isolation based on > Windows user isolation seems most feasible. This approach is similar to the > approach taken by the existing LinuxContainerExecutor. The current patch to > winutils.exe adds the ability to create a process as a domain user. > h1. Alternative Methods considered: > h2. Process rights limited by security token restriction: > On Windows access decisions are made by examining the security token of a > process. It is possible to spawn a process with a restricted security token. > Any of the rights granted by SIDs of the default token may be restricted. It > is possible to see this in action by examining the security tone of a > sandboxed process launch be a web browser. Typically the launched process > will have a fully restricted token and need to access machine resources > through a dedicated broker process that enforces a custom security policy. > This broker process mechanism would break compatibility with the typical > Hadoop container process. The Container process must be able to utilize > standard function calls for disk and network IO. I performed some work > looking at ways to ACL the local files to the specific launched without > granting rights to other processes launched on the same machine but found > this to be an overly complex solution. > h2. Relying on APP containers: > Recent versions of windows have the ability to launch processes within an > isolated container. Application containers are supported for execution of > WinRT based executables. This method was ruled out due to the lack of > official support for standard windows APIs. At some point in the future > windows may support functionality similar to BSD jails or Linux containers, > at that point support for containers should be added. > h1. Create As User Feature Description: > h2. Usage: > A new sub command was added to the set of task commands. Here is the syntax: > winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] > Some notes: > * The username specified is in the format of "user@domain" > * The machine executing this command must be joined to the domain of the user > specified > * The domain controller must allow the account executing the command access > to the user information. For this join the account to the predefined group > labeled "Pre-Windows 2000 Compatible Access" > * The account running the command must have several rights on the local > machine. These can be managed manually using secpol.msc: > ** "Act as part of the operating system" - SE_TCB_NAME > ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME > ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME > * The launched process will not have rights to the desktop so will not be > able to display any information or create UI. > * The launched process will have no network credentials. Any access of > network resources that requires domain authentication will fail. > h2. Implementation: > Winutils performs the following steps: > # Enable the required privileges for the current process. > # Register as a trusted process with the Local Security Authority (LSA). > # Create a new logon for the user passed on the command line. > # Load/Create a profile on the local machine for the new logon. > # Create a new environment for the new logon. > # Launch the new process in a job with the task name specified and using the > created logon. > # Wait for the JOB to exit. > h2. Future work: > The following work was scoped out of this check in: > * Support for non-domain users or machine that are not domain joined. > * Support for privilege isolation by running the task launcher in a high > privilege service with access over an ACLed named pipe. -- T
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152507#comment-14152507 ] Craig Welch commented on YARN-1972: --- ContainerLaunch launchContainer - nit, why "userName" here, it's user everywhere else getLocalWrapperScriptBuilder - why not an override instead of conditional (see below wrt WindowsContainerExecutor) WindowsSecureContainerExecutor - I really think there should be a "WindowsContainerExecutor" and that we should go ahead and have differences move generally to inheritance rather than conditional (as far as reasonable/related to the change, and incrementally as we go forward, no need to boil the ocean, but it would be good to set a good foundation here) Windows specific logic, secure or not, should be based in this class. If the differences required for security specific logic are significant enough, by all means also have a WindowsSecureContainerExecutor which inherits from WindowsContainerExecutor. I think, as much as possible, the logic should be the same for both - with only the security specific functionality as a delta (right now, it looks like non-secure windows uses default for implementation, and may differ more from the "windows secure" than it should) > Implement secure Windows Container Executor > --- > > Key: YARN-1972 > URL: https://issues.apache.org/jira/browse/YARN-1972 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, > YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, > YARN-1972.trunk.5.patch > > > h1. Windows Secure Container Executor (WCE) > YARN-1063 adds the necessary infrasturcture to launch a process as a domain > user as a solution for the problem of having a security boundary between > processes executed in YARN containers and the Hadoop services. The WCE is a > container executor that leverages the winutils capabilities introduced in > YARN-1063 and launches containers as an OS process running as the job > submitter user. A description of the S4U infrastructure used by YARN-1063 > alternatives considered can be read on that JIRA. > The WCE is based on the DefaultContainerExecutor. It relies on the DCE to > drive the flow of execution, but it overwrrides some emthods to the effect of: > * change the DCE created user cache directories to be owned by the job user > and by the nodemanager group. > * changes the actual container run command to use the 'createAsUser' command > of winutils task instead of 'create' > * runs the localization as standalone process instead of an in-process Java > method call. This in turn relies on the winutil createAsUser feature to run > the localization as the job user. > > When compared to LinuxContainerExecutor (LCE), the WCE has some minor > differences: > * it does no delegate the creation of the user cache directories to the > native implementation. > * it does no require special handling to be able to delete user files > The approach on the WCE came from a practical trial-and-error approach. I had > to iron out some issues around the Windows script shell limitations (command > line length) to get it to work, the biggest issue being the huge CLASSPATH > that is commonplace in Hadoop environment container executions. The job > container itself is already dealing with this via a so called 'classpath > jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch > as a separate container the same issue had to be resolved and I used the same > 'classpath jar' approach. > h2. Deployment Requirements > To use the WCE one needs to set the > `yarn.nodemanager.container-executor.class` to > `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` > and set the `yarn.nodemanager.windows-secure-container-executor.group` to a > Windows security group name that is the nodemanager service principal is a > member of (equivalent of LCE > `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE > does not require any configuration outside of the Hadoop own's yar-site.xml. > For WCE to work the nodemanager must run as a service principal that is > member of the local Administrators group or LocalSystem. this is derived from > the need to invoke LoadUserProfile API which mention these requirements in > the specifications. This is in addition to the SE_TCB privilege mentioned in > YARN-1063, but this requirement will automatically imply that the SE_TCB > privilege is held by the nodemanager. For the Linux speakers in the audience, > the requirement is basically to run NM as root. > h2.
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152523#comment-14152523 ] Craig Welch commented on YARN-2198: --- pom.xml - don’t see a /etc/hadoop or a wsce-site.xml, missed? RawLocalFileSystem Is someone from HDFS looking at this? protected boolean mkOneDir(File p2f) throws IOException - nit, generalize arg name pls return (parent == null || parent2f.exists() || mkdirs(parent)) && + (mkOneDir(p2f) || p2f.isDirectory()); so, I don't get this logic, & believe it will fail if the path exists and is not a directory. Why not just do if p2f doesn't exist mkdirs(p2f)? seems much simpler, and drops the need for mkOneDir NativeIO Elevated class - I believe this is Windows specific, "WindowsElevated" or "ElevatedWindows"? Why doesn't it extend "Windows" - I don't think secure and insecure windows should become "wholly dissimilar" createTaskAsUser, killTask, ProcessStub: These aren't really "io", I think they should be factored out to their own process-specific class > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, > YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, > YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, > YARN-2198.separation.patch, YARN-2198.trunk.10.patch, > YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, > YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires the process launching the container to be LocalSystem or a > member of the a local Administrators group. Since the process in question is > the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152532#comment-14152532 ] Zhijie Shen commented on YARN-2583: --- Some thoughts about the log deletion service of LRS: 1. I'm not sure if it's good to do normal log deletion in AggregatedLogDeletionService, while deleting rolling logs in AppLogAggregatorImpl. AggregatedLogDeletionService (inside JHS) will still try to delete the whole log dir while the LRS is still running. 2. Usually we do retention by time instead of by size, and it's inconsistent between AggregatedLogDeletionService and AppLogAggregatorImpl. While AggregatedLogDeletionService keeps all the logs newer than T1, AppLogAggregatorImpl may have already deleted logs newer than T1 to limit the number of logs of the LRS. It's going to be unpredictable after what time the logs should be still available for access. 3. Another problem w.r.t. NM_LOG_AGGREGATION_RETAIN_RETENTION_SIZE_PER_APP is that the config is favor of the longer rollingIntervalSeconds. For example, NM_LOG_AGGREGATION_RETAIN_RETENTION_SIZE_PER_APP = 10. If a LRS sets rollingIntervalSeconds = 1D, after 10D, it's still going to keep all the logs. However, If the LRS sets rollingIntervalSeconds = 0.5D, after 10D, it can only keep the last 5D's logs, even though the amount of generated logs is the same. 4. Assume we want to do deletion in AppLogAggregatorImpl, should we do deletion first and uploading next to avoid that the number of logs can go beyond the cap temporally? > Modify the LogDeletionService to support Log aggregation for LRS > > > Key: YARN-2583 > URL: https://issues.apache.org/jira/browse/YARN-2583 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2583.1.patch > > > Currently, AggregatedLogDeletionService will delete old logs from HDFS. It > will check the cut-off-time, if all logs for this application is older than > this cut-off-time. The app-log-dir from HDFS will be deleted. This will not > work for LRS. We expect a LRS application can keep running for a long time. > Two different scenarios: > 1) If we configured the rollingIntervalSeconds, the new log file will be > always uploaded to HDFS. The number of log files for this application will > become larger and larger. And there is no log files will be deleted. > 2) If we did not configure the rollingIntervalSeconds, the log file can only > be uploaded to HDFS after the application is finished. It is very possible > that the logs are uploaded after the cut-off-time. It will cause problem > because at that time the app-log-dir for this application in HDFS has been > deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2598) GHS should show N/A instead of null for the inaccessible information
[ https://issues.apache.org/jira/browse/YARN-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152534#comment-14152534 ] Mayank Bansal commented on YARN-2598: - +1 LGTM. Will run tests and if succeeds will commit Thanks, Mayank > GHS should show N/A instead of null for the inaccessible information > > > Key: YARN-2598 > URL: https://issues.apache.org/jira/browse/YARN-2598 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2598.1.patch > > > When the user doesn't have the access to an application, the app attempt > information is not visible to the user. ClientRMService will output N/A, but > GHS is showing null, which is not user-friendly. > {code} > 14/09/24 22:07:20 INFO impl.TimelineClientImpl: Timeline service address: > http://nn.example.com:8188/ws/v1/timeline/ > 14/09/24 22:07:20 INFO client.RMProxy: Connecting to ResourceManager at > nn.example.com/240.0.0.11:8050 > 14/09/24 22:07:21 INFO client.AHSProxy: Connecting to Application History > server at nn.example.com/240.0.0.11:10200 > Application Report : > Application-Id : application_1411586934799_0001 > Application-Name : Sleep job > Application-Type : MAPREDUCE > User : hrt_qa > Queue : default > Start-Time : 1411586956012 > Finish-Time : 1411586989169 > Progress : 100% > State : FINISHED > Final-State : SUCCEEDED > Tracking-URL : null > RPC Port : -1 > AM Host : null > Aggregate Resource Allocation : N/A > Diagnostics : null > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2301: Attachment: YARN-2303.patch Attaching patch for the unit test failures. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > Attachments: YARN-2301.01.patch, YARN-2303.patch > > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152559#comment-14152559 ] Mayank Bansal commented on YARN-2320: - I think overal looks ok however Have to run. some small comments shouldn't we use N/A in convertToApplicationAttemptReport instead of null ? Similarly for convertToApplicationReport? Similary for convertToContainerReport? > Removing old application history store after we store the history data to > timeline store > > > Key: YARN-2320 > URL: https://issues.apache.org/jira/browse/YARN-2320 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2320.1.patch, YARN-2320.2.patch > > > After YARN-2033, we should deprecate application history store set. There's > no need to maintain two sets of store interfaces. In addition, we should > conclude the outstanding jira's under YARN-321 about the application history > store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152579#comment-14152579 ] Hadoop QA commented on YARN-2301: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671912/YARN-2303.patch against trunk revision c88c6c5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5175//console This message is automatically generated. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > Attachments: YARN-2301.01.patch, YARN-2303.patch > > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.9.patch > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2179: --- Attachment: YARN-2179-trunk-v10.patch [~vinodkv] [~kasha] Attached is v10. Here is a new approach where I extend YarnClientImpl, stub out the service init/start/stop methods and mock the relevant methods to test. Does this seem like a cleaner approach to you guys? I tried to do a straight mocking without extending the abstract class, but continually ran into the issue that AbstractService.stateModel is initialized in the constructor. This creates a problem when trying to stub AbstractService.getServiceState(), which is required for the AbstractService to work with a CompositeService. Let me know if you don't like this approach or you know of an easier method and I can readjust the patch. Thanks! > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, > YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, > YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, > YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152599#comment-14152599 ] Xuan Gong commented on YARN-2468: - bq. Why is the test in TestAggregatedLogsBlock ignored? We will have YARN-2583 for web UI related changes. This test will be failed right now. So, I add @ignored bq. pendingUploadFiles is really not neded to be a class field. Rename getNumOfLogFilesToUpload() to be getPendingLogFilesToUploadForThisContainer() and return the set of pending files. LogValue.write() can then take Set pendingLogFilesToUpload as one of the arguments. I would like to check how many log files we can upload this time. If the number is 0, we can skip this time. And this check is also happened before LogKey.write(), otherwise, we will write key, but without value. bq. If deletion of previously uploaded file takes a while and the file remains by the time of the next cycle, we will upload it again? It seems to be, let's validate this via a test-case. No, it will not. That is why I saved many information, such as allExistingFiles, alreadyUploadedFiles and etc. We will those to check whether the logs have been uploaded before. bq. testLogAggregationServiceWithInterval: doLogAggregationOutOfBand + Thread.sleep() is unreliable. Use a clock and refactor AppLogAggregatorImpl to have the cyclic aggregation directly callable via a method. The Thread.sleep() is not used to trigger the logAggregation. It is used to make sure the logs has been uploaded into the remote directory. But, deleted those Thread.sleep() from the testcases. > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152600#comment-14152600 ] Xuan Gong commented on YARN-2468: - New patch addressed all other comments > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted
Anubhav Dhoot created YARN-2624: --- Summary: Resource Localization fails on a secure cluster until nm are restarted Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We have found resource localization fails on a secure cluster with following error in certain cases. This happens at some indeterminate point after which it will keep failing until NM is restarted. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2624: Component/s: nodemanager > Resource Localization fails on a secure cluster until nm are restarted > -- > > Key: YARN-2624 > URL: https://issues.apache.org/jira/browse/YARN-2624 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > We have found resource localization fails on a secure cluster with following > error in certain cases. This happens at some indeterminate point after which > it will keep failing until NM is restarted. > {noformat} > INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, > 1412027745352, FILE, null > },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} > java.io.IOException: Rename cannot overwrite non empty destination directory > /data/yarn/nm/filecache/27 > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) > at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152637#comment-14152637 ] Hadoop QA commented on YARN-2179: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671924/YARN-2179-trunk-v10.patch against trunk revision c88c6c5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5176//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5176//console This message is automatically generated. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, > YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, > YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, > YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)
[ https://issues.apache.org/jira/browse/YARN-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2621: -- Attachment: YARN-2621.1.patch Create a patch to fix the problem > Simplify the output when the user doesn't have the access for getDomain(s) > --- > > Key: YARN-2621 > URL: https://issues.apache.org/jira/browse/YARN-2621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2621.1.patch > > > Per discussion in > [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272], > we should simply reject the user if it doesn't have access the domain(s), > instead of returning the entity without detail information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)
[ https://issues.apache.org/jira/browse/YARN-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152695#comment-14152695 ] Hadoop QA commented on YARN-2621: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671931/YARN-2621.1.patch against trunk revision 0577eb3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5177//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5177//console This message is automatically generated. > Simplify the output when the user doesn't have the access for getDomain(s) > --- > > Key: YARN-2621 > URL: https://issues.apache.org/jira/browse/YARN-2621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2621.1.patch > > > Per discussion in > [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272], > we should simply reject the user if it doesn't have access the domain(s), > instead of returning the entity without detail information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152717#comment-14152717 ] Hadoop QA commented on YARN-2468: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671923/YARN-2468.9.patch against trunk revision 0577eb3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5178//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5178//console This message is automatically generated. > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2387: Attachment: YARN-2387.patch > Resource Manager crashes with NPE due to lack of synchronization > > > Key: YARN-2387 > URL: https://issues.apache.org/jira/browse/YARN-2387 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 2.5.0 >Reporter: Mit Desai >Assignee: Mit Desai >Priority: Blocker > Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch > > > We recently came across a 0.23 RM crashing with an NPE. Here is the > stacktrace for it. > {noformat} > 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) > at > org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) > at java.lang.Thread.run(Thread.java:722) > 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} > On investigating a on the issue we found that the ContainerStatusPBImpl has > methods that are called by different threads and are not synchronized. Even > the 2.X code looks alike. > We need to make these methods synchronized so that we do not encounter this > problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.9.1.patch > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, > YARN-2468.9.1.patch, YARN-2468.9.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152752#comment-14152752 ] Hadoop QA commented on YARN-2387: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671946/YARN-2387.patch against trunk revision 0577eb3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5179//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5179//console This message is automatically generated. > Resource Manager crashes with NPE due to lack of synchronization > > > Key: YARN-2387 > URL: https://issues.apache.org/jira/browse/YARN-2387 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 2.5.0 >Reporter: Mit Desai >Assignee: Mit Desai >Priority: Blocker > Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch > > > We recently came across a 0.23 RM crashing with an NPE. Here is the > stacktrace for it. > {noformat} > 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) > at > org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) > at java.lang.String.valueOf(String.java:2854) > at java.lang.StringBuilder.append(StringBuilder.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) > at java.lang.Thread.run(Thread.java:722) > 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} > On investigating a on the issue we found that the ContainerStatusPBImpl has > methods that are called by different threads and are not synchronized. Even > the 2.X code looks alike. > We need to make these methods synchronized so that we do not encounter this > problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2545) RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED
[ https://issues.apache.org/jira/browse/YARN-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152804#comment-14152804 ] Hong Zhiguo commented on YARN-2545: --- [~leftnoteasy], [~jianhe], [~ozawa], please have a look, should we set state of app/appAttempt to FAILED instead of FINISHED, or just count it as "Apps Failed" instead of "Apps Completed"? > RMApp should transit to FAILED when AM calls finishApplicationMaster with > FAILED > > > Key: YARN-2545 > URL: https://issues.apache.org/jira/browse/YARN-2545 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Minor > > If AM calls finishApplicationMaster with getFinalApplicationStatus()==FAILED, > and then exits, the corresponding RMApp and RMAppAttempt transit to state > FINISHED. > I think this is wrong and confusing. On RM WebUI, this application is > displayed as "State=FINISHED, FinalStatus=FAILED", and is counted as "Apps > Completed", not as "Apps Failed". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152835#comment-14152835 ] Hadoop QA commented on YARN-2468: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671951/YARN-2468.9.1.patch against trunk revision 0577eb3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5181//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5181//console This message is automatically generated. > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, > YARN-2468.9.1.patch, YARN-2468.9.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152842#comment-14152842 ] zhihai xu commented on YARN-2566: - Picking the directory with most available space is a good suggestion. I will implement it in my new patch. thanks > IOException happen in startLocalizer of DefaultContainerExecutor due to not > enough disk space for the first localDir. > - > > Key: YARN-2566 > URL: https://issues.apache.org/jira/browse/YARN-2566 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2566.000.patch, YARN-2566.001.patch > > > startLocalizer in DefaultContainerExecutor will only use the first localDir > to copy the token file, if the copy is failed for first localDir due to not > enough disk space in the first localDir, the localization will be failed even > there are plenty of disk space in other localDirs. We see the following error > for this case: > {code} > 2014-09-13 23:33:25,171 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to > create app directory > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > java.io.IOException: mkdir of > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,185 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.FileNotFoundException: File > file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at > org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) > at > org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344) > at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,186 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1410663092546_0004_01_01 transitioned from > LOCALIZING to LOCALIZATION_FAILED > 2014-09-13 23:33:25,187 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: LOCA