[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084438#comment-14084438 ] JiankunLiu commented on YARN-1021: -- Hi Wei Yan,I meet with the following exception when I use this commond. bin/slsrun.sh --input-rumen=sample-data/2jobs2min-rumen-jh.json --output-dir=sample_output Exception in thread pool-3-thread-1 java.lang.NullPointerException at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.registerAM(AMSimulator.java:300) at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:141) at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.firstStep(MRAMSimulator.java:146) at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:84) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Can you do me a favor ? Thank you very much. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.3.0 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2381) CLONE - Yarn Scheduler Load Simulator
JiankunLiu created YARN-2381: Summary: CLONE - Yarn Scheduler Load Simulator Key: YARN-2381 URL: https://issues.apache.org/jira/browse/YARN-2381 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: JiankunLiu Assignee: Wei Yan Fix For: 2.3.0 The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2136) RMStateStore can explicitly handle store/update events when fenced
[ https://issues.apache.org/jira/browse/YARN-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084473#comment-14084473 ] Varun Saxena commented on YARN-2136: Thanks [~jianhe]. I will work on it. What I will do is as under : 1. Add a FENCED state in state machine. 2. When fencing happens, switch state to FENCED. RM will switch to standby as before. 3. Do nothing for Store and Update events if we are in FENCED state. 4. Reset state to DEFAULT when this RM switches back to Active. RMStateStore can explicitly handle store/update events when fenced -- Key: YARN-2136 URL: https://issues.apache.org/jira/browse/YARN-2136 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He RMStateStore can choose to handle/ignore store/update events upfront instead of invoking more ZK operations if state store is at fenced state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084540#comment-14084540 ] Remus Rusanu commented on YARN-2198: The NM never explicitly reads the stdout/stderr from the container, the streams are redirected today to their own log files according as the user's code dictates (for e.g in linux bash -c user-command.sh 1 stderr 2stdout). Do we need to do this in the WintuilsProcessStubExecutor ? The existing org.apache.hadoop.util.Shell.runCommand() does read the stdout and stderr of the launched process. The WSCE should preserve this. For one the container redirection occurs not in the launched process per-se, but in the shell wrapper script, so any stdout/stderr output that occurs in the wrapper script *before* the redirect would be lost. But more importantly the NM has another user for launchers: the localizer. This runs w/o a shell wrapper script and has no stdout/stderr redirect. So all in all I think there are valid reasons to capture the stdout/stderr. It is one of my goals to make WintuilsProcessStubExecutor reuse more of the existing Shell.ShellCommandExecutor implementation, rather than implement its own execute(). (1) restricting users who can launch the special service and (2) restricting callers who can invoke the RPCs. So, this is done by the combination of the OS doing the authentication and the authorization being explicitly done by the service using the allowed list. Right? That is correct. There are 3 access checks that occur: - task.c: ValidateImpersonateAccessCheck checks that the desired runas account is allowed to be impersonated (this includes a check against the groups/accounts explicitly denied). this covers the restricting users who can launch the special service - service.c: RpcAuthorizeCallback checks that the caller (NM) has the permission invoke the RPC service. this covers the restricting callers who can invoke the RPCs - client.c: RpcCall_TaskCreateAsUser has an implicit access check in which the NM validates that the RPC call is serviced by an authorized service (by specifying RPC_C_QOS_CAPABILITIES_MUTUAL_AUTH and explicitly requesting the desired SID via pLocalSystemSid). This covers the 'mutual trust' part on the RPC call boundary (ie. is the complement of restricting callers who can invoke the RPCs) I will address the other comments with a newer patch, ETA Wednesday. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084547#comment-14084547 ] Remus Rusanu commented on YARN-2198: How the service's port is configured LRPC protocol has no port in the sense of TCP/UDP ports. It uses shared memory as a transport. The equivalent is the LRPC endpoint name, which in the patch is declared in the IDL file as {code}endpoint(ncalrpc:[hdpwinutilsvc]){code}. Using hardcoded (ie. known by both client and server) endpoints names for local LRCP is common practice. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2370) Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084550#comment-14084550 ] Hudson commented on YARN-2370: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #633 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/633/]) YARN-2370. Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo (Contributed by Wenwu Peng) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1615469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Labels: newbie Fix For: 2.6.0 Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084554#comment-14084554 ] Junping Du commented on YARN-1354: -- bq. It's possible if they explicitly handle it themselves (e.g.: write out schema versions and switch on it during load), however they don't typically do that. I agree that YARN-668 is the proper place to discuss how best to migrate Credentials and Tokens to a more manageable infrastructure for supporting upgrades. Agree. We should have built-in mechanism to handle serialize/deserialize across versions. Let's discussion more on YARN-668. bq. Therefore I updated the reconnected node path to forward the applications running on the node and have the RM inform the NM to finish the application if it is no longer active. Sounds good. New added code and tests make sense to me. +1. Will commit it shortly. Recover applications upon nodemanager restart - Key: YARN-1354 URL: https://issues.apache.org/jira/browse/YARN-1354 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, YARN-1354-v4.patch, YARN-1354-v5.patch, YARN-1354-v6.patch The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084647#comment-14084647 ] Hudson commented on YARN-1354: -- FAILURE: Integrated in Hadoop-trunk-Commit #6007 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6007/]) YARN-1354. Recover applications upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1615550) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java Recover applications upon nodemanager restart - Key: YARN-1354 URL: https://issues.apache.org/jira/browse/YARN-1354 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, YARN-1354-v4.patch, YARN-1354-v5.patch, YARN-1354-v6.patch The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2370) Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084666#comment-14084666 ] Hudson commented on YARN-2370: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1827 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1827/]) YARN-2370. Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo (Contributed by Wenwu Peng) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1615469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Labels: newbie Fix For: 2.6.0 Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084670#comment-14084670 ] Junping Du commented on YARN-2033: -- Thanks [~zjshen] for uploading a proposal and keep updating patches! The proposal sounds like a good direction. I just take a glance at latest patch which looks pretty big for me and mix of code refactor (move location of TimelineClient, etc.), side logic like: adding originalURL to ApplicationAttemptReport, etc. Can we break this patch to several pieces and create a sub JIRA for each one? That should be much easier for review. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084689#comment-14084689 ] JiankunLiu commented on YARN-1021: -- I found that it was no exception on hadoop-2.3.0. But I meet with the following exception on hadoop-2.4.0. Exception in thread pool-3-thread-1 java.lang.NullPointerException at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.registerAM(AMSimulator.java:300) at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:141) at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.firstStep(MRAMSimulator.java:146) at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:84) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) what should I do? Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.3.0 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2370) Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo
[ https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084740#comment-14084740 ] Hudson commented on YARN-2370: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1852 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1852/]) YARN-2370. Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo (Contributed by Wenwu Peng) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1615469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java Fix comment in o.a.h.y.server.resourcemanager.schedulerAppSchedulingInfo Key: YARN-2370 URL: https://issues.apache.org/jira/browse/YARN-2370 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Priority: Trivial Labels: newbie Fix For: 2.6.0 Attachments: YARN-2370.0.patch in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update OffRack request, the comment should be Update cloned OffRack requests for recovery not Update cloned RackLocal and OffRack requests for recover {code} // Update cloned RackLocal and OffRack requests for recovery resourceRequests.add(cloneResourceRequest(offSwitchRequest)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084797#comment-14084797 ] Wei Yan commented on YARN-1021: --- [~Jackliu91], yes, the simulator has exception in hadoop-2.4.0, the patch provided in YARN-1726. The coming released hadoop-2.5.0 will include that patch and the simulator should run well. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.3.0 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2381) CLONE - Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2381: -- Assignee: (was: Wei Yan) CLONE - Yarn Scheduler Load Simulator - Key: YARN-2381 URL: https://issues.apache.org/jira/browse/YARN-2381 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: JiankunLiu Fix For: 2.3.0 The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2381) CLONE - Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084804#comment-14084804 ] Wei Yan commented on YARN-2381: --- [~Jackliu91], purpose of this jira? CLONE - Yarn Scheduler Load Simulator - Key: YARN-2381 URL: https://issues.apache.org/jira/browse/YARN-2381 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: JiankunLiu Fix For: 2.3.0 The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084869#comment-14084869 ] Zhijie Shen commented on YARN-2033: --- [~djp], thanks for you volunteering to review this big patch. bq. I just take a glance at latest patch which looks pretty big for me and mix of code refactor (move location of TimelineClient, etc.), In case of the big refactor work, I've already had two jiras for them: YARN-2298 and YARN-2302. Please note that in this jira I posted the patch of generic-history in timeline store only and the uber patch including all code changes (with _ALL suffix). bq. side logic like: adding originalURL to ApplicationAttemptReport, etc. Can we break this patch to several pieces and create a sub JIRA for each one? I prefer not to do it right now, because moving the minor change out is not going to squeeze this patch significant. Instead, I've to maintain a longer dependency chain of the outstanding patches. Do you agree? Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2136) RMStateStore can explicitly handle store/update events when fenced
[ https://issues.apache.org/jira/browse/YARN-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084971#comment-14084971 ] Jian He commented on YARN-2136: --- On a second check, state-store is actually already stopped/closed when RM gets fenced, which means no more store operation can be performed once RM gets fenced. [~kkambatl], is this the case? RMStateStore can explicitly handle store/update events when fenced -- Key: YARN-2136 URL: https://issues.apache.org/jira/browse/YARN-2136 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He RMStateStore can choose to handle/ignore store/update events upfront instead of invoking more ZK operations if state store is at fenced state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084981#comment-14084981 ] Zhijie Shen commented on YARN-1954: --- My concern is that if users want a even larger checkEveryMillis, he may not get it, and the log is actually recorded at 6ms silently. If 6ms is the interval cap, we should somehow let users know. Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2374) YARN trunk build failing TestDistributedShell.testDSShell
[ https://issues.apache.org/jira/browse/YARN-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084986#comment-14084986 ] Jian He commented on YARN-2374: --- looks good, [~vvasudev], can you add some comments to the checkHostname method to explain what its doing? thx YARN trunk build failing TestDistributedShell.testDSShell - Key: YARN-2374 URL: https://issues.apache.org/jira/browse/YARN-2374 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2374.0.patch, apache-yarn-2374.1.patch, apache-yarn-2374.2.patch, apache-yarn-2374.3.patch The YARN trunk build has been failing for the last few days in the distributed shell module. {noformat} testDSShell(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 27.269 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:188) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-857) Errors when localizing end up with the localization failure not being seen by the NM
[ https://issues.apache.org/jira/browse/YARN-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-857: - Assignee: Vinod Kumar Vavilapalli (was: Hitesh Shah) Errors when localizing end up with the localization failure not being seen by the NM Key: YARN-857 URL: https://issues.apache.org/jira/browse/YARN-857 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Vinod Kumar Vavilapalli Attachments: YARN-857.1.patch, YARN-857.2.patch at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:106) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:978) Traced this down to DefaultExecutor which does not look at the exit code for the localizer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085040#comment-14085040 ] Zhijie Shen commented on YARN-2301: --- [~Naganarasimha], do you mean you want to move bq. appId with -all: containers of all app attempts in RM/Timeline server or bq. May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. later? Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2375: -- Issue Type: Sub-task (was: Improvement) Parent: YARN-1530 Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-2138: -- Assignee: Varun Saxena Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Fix For: 2.5.0 The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-72) NM should handle cleaning up containers when it shuts down
[ https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-72: Assignee: Sandy Ryza (was: Vinod Kumar Vavilapalli) NM should handle cleaning up containers when it shuts down -- Key: YARN-72 URL: https://issues.apache.org/jira/browse/YARN-72 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Sandy Ryza Fix For: 2.0.3-alpha, 0.23.6 Attachments: YARN-72-1.patch, YARN-72-2.patch, YARN-72-2.patch, YARN-72-2.patch, YARN-72.patch Ideally, the NM should wait for a limited amount of time when it gets a shutdown signal for existing containers to complete and kill the containers ( if we pick an aggressive approach ) after this time interval. For NMs which come up after an unclean shutdown, the NM should look through its directories for existing container.pids and try and kill an existing containers matching the pids found. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085556#comment-14085556 ] Xuan Gong commented on YARN-2212: - new patch addressed all latest comments. bq. Please try on secure cluster too. Tested in secure HA/non-HA cluster ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.6.patch ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085628#comment-14085628 ] Naganarasimha G R commented on YARN-2301: - Hi [~zjshen], I believe the existing working of bq. yarn container -list appAttemptID itself is not proper (wrt YARN-1794) and once all the data is taken from time line server it would make proper sense for the end user. End user should not bother its coming from AHS/Timeline/RM and result should be list of all containers for the given attempt or app ID. I have the code for the command bd.yarn container -list appAttemptID/appID but i feel it would not be complete until timeline integration is done and users of this command will again get confused if given with half feature So my pick would be to your second option as new jira i.e bq. *yarn container -list appAttemptID/appID* but if you see timeline server intergration will be delayed then we can discuss what part of the command could be taken and the behavior of it. Also i would like to fix YARN-1794-??show all containers when application is finished from AHS instead of no containers from RM?? (again here too . i would like to fix the bug after timeline integration, so that list of all containers for the given attempt or app ID irrespective of the app state ) Please provide your opinion. Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Naganarasimha G R Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2352: --- Attachment: fs-perf-metrics.png Just got a working version of this, and here is a screenshot from a FileSink that I hooked up. Will post the patch momentarily. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2352: --- Attachment: yarn-2352-1.patch FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.11.patch ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085722#comment-14085722 ] Hadoop QA commented on YARN-2212: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659777/YARN-2212.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4519//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4519//console This message is automatically generated. ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch, YARN-2212.5.patch, YARN-2212.5.patch, YARN-2212.5.rebase.patch, YARN-2212.6.patch, YARN-2212.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2352) FairScheduler: Collect metrics on duration of critical methods that affect performance
[ https://issues.apache.org/jira/browse/YARN-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085725#comment-14085725 ] Hadoop QA commented on YARN-2352: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659794/yarn-2352-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4520//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4520//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4520//console This message is automatically generated. FairScheduler: Collect metrics on duration of critical methods that affect performance -- Key: YARN-2352 URL: https://issues.apache.org/jira/browse/YARN-2352 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: fs-perf-metrics.png, yarn-2352-1.patch We need more metrics for better visibility into FairScheduler performance. At the least, we need to do this for (1) handle node events, (2) update, (3) compute fairshares, (4) preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2382) Resource Manager throws InvalidStateTransitonException
Nishan Shetty created YARN-2382: --- Summary: Resource Manager throws InvalidStateTransitonException Key: YARN-2382 URL: https://issues.apache.org/jira/browse/YARN-2382 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty {code} 2014-08-05 03:44:47,882 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.18.40.26/10.18.40.26:11578, initiating session 2014-08-05 03:44:47,888 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.18.40.26/10.18.40.26:11578, sessionid = 0x347a051fda60035, negotiated timeout = 1 2014-08-05 03:44:47,889 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:664) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:764) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:745) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) 2014-08-05 03:44:47,890 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x147a051fd93002e {code} -- This message was sent by Atlassian JIRA
[jira] [Commented] (YARN-2136) RMStateStore can explicitly handle store/update events when fenced
[ https://issues.apache.org/jira/browse/YARN-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085826#comment-14085826 ] Varun Saxena commented on YARN-2136: Yes, no more store operations can be performed once RM switches to standby. But there might be some threads processing store/update once that is going on. I thought JIRA was to handle that case. RMStateStore can explicitly handle store/update events when fenced -- Key: YARN-2136 URL: https://issues.apache.org/jira/browse/YARN-2136 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He RMStateStore can choose to handle/ignore store/update events upfront instead of invoking more ZK operations if state store is at fenced state. -- This message was sent by Atlassian JIRA (v6.2#6252)