[jira] [Updated] (YARN-1172) Convert *SecretManagers in the RM to services
[ https://issues.apache.org/jira/browse/YARN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1172: - Attachment: YARN-1172.7.patch Convert *SecretManagers in the RM to services - Key: YARN-1172 URL: https://issues.apache.org/jira/browse/YARN-1172 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1172.1.patch, YARN-1172.2.patch, YARN-1172.3.patch, YARN-1172.4.patch, YARN-1172.5.patch, YARN-1172.6.patch, YARN-1172.7.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1172) Convert *SecretManagers in the RM to services
[ https://issues.apache.org/jira/browse/YARN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798875#comment-13798875 ] Hadoop QA commented on YARN-1172: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609100/YARN-1172.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2218//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2218//console This message is automatically generated. Convert *SecretManagers in the RM to services - Key: YARN-1172 URL: https://issues.apache.org/jira/browse/YARN-1172 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1172.1.patch, YARN-1172.2.patch, YARN-1172.3.patch, YARN-1172.4.patch, YARN-1172.5.patch, YARN-1172.6.patch, YARN-1172.7.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1319) Documentation has wrong entry
[ https://issues.apache.org/jira/browse/YARN-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798961#comment-13798961 ] Siddharth Tiwari commented on YARN-1319: The installation documentation for Hadoop yarn at this link http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html has error in the yarn-site for property yarn.nodemanager.aux-services. it should be mapreduce_shuffle rather than mapreduce.shuffle. Documentation has wrong entry -- Key: YARN-1319 URL: https://issues.apache.org/jira/browse/YARN-1319 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.2.0 Environment: Linux Reporter: Siddharth Tiwari Priority: Minor Fix For: 2.2.0 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1319) Documentation has wrong entry
Siddharth Tiwari created YARN-1319: -- Summary: Documentation has wrong entry Key: YARN-1319 URL: https://issues.apache.org/jira/browse/YARN-1319 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.2.0 Environment: Linux Reporter: Siddharth Tiwari Priority: Minor Fix For: 2.2.0 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1319) Documentation has wrong entry
[ https://issues.apache.org/jira/browse/YARN-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Tiwari updated YARN-1319: --- Description: The installation documentation for Hadoop yarn at this link http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html has error in the yarn-site for property yarn.nodemanager.aux-services. it should be mapreduce_shuffle rather than mapreduce.shuffle. Documentation has wrong entry -- Key: YARN-1319 URL: https://issues.apache.org/jira/browse/YARN-1319 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.2.0 Environment: Linux Reporter: Siddharth Tiwari Priority: Minor Fix For: 2.2.0 The installation documentation for Hadoop yarn at this link http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html has error in the yarn-site for property yarn.nodemanager.aux-services. it should be mapreduce_shuffle rather than mapreduce.shuffle. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-1319) Documentation has wrong entry
[ https://issues.apache.org/jira/browse/YARN-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Gupta resolved YARN-1319. --- Resolution: Duplicate Documentation has wrong entry -- Key: YARN-1319 URL: https://issues.apache.org/jira/browse/YARN-1319 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.2.0 Environment: Linux Reporter: Siddharth Tiwari Priority: Minor Fix For: 2.2.0 The installation documentation for Hadoop yarn at this link http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html has error in the yarn-site for property yarn.nodemanager.aux-services. it should be mapreduce_shuffle rather than mapreduce.shuffle. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799170#comment-13799170 ] Steve Loughran commented on YARN-913: - # your design reminds me a lot of bonjour, just a thought. # my use case is ensure that no other instance of my application name exists, eg {{steve/hoya/cluster4}} , so avoiding race conditions. I'd have the server attempt to register on startup -and if it could not, fail. Implication: atomic registration by name # Hadoop now ships with the ZK JAR, for HA NN, soon the RM will use it too. This will let us assume that ZK is a live service, and make use of it. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api Affects Versions: 3.0.0 Reporter: Steve Loughran Assignee: Robert Joseph Evans Attachments: RegistrationServiceDetails.txt In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799180#comment-13799180 ] Steve Loughran commented on YARN-614: - Chris -are you using this? For long lived services we'd need that sliding window of failures Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Chris Riccomini Fix For: 2.3.0 Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799189#comment-13799189 ] Steve Loughran commented on YARN-796: - I'd like to be able to allocate different labels to different queues, so that analytics workloads could go to one set of machines, network ingress/egress applications to another pool. You don't want to add label awareness to these applications, whereas queue-level would seem more appropriate, as it puts the cluster admins in charge Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Arun C Murthy It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-896) Roll up for long-lived services in YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799199#comment-13799199 ] Steve Loughran commented on YARN-896: - Link to YARN-810, CGroup limits for CPU Roll up for long-lived services in YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1305: - Attachment: YARN-1305.4.patch Thank you, Bikas. Updated a patch based on your review. * Updated getRMHAId()/setConfValue() to show invalid value. * Updated getRMHAIds() to handle a case RM_HA_IDS is empty. Added a test to testGetRMServiceId for this. * Updated getRMHAId() to be an error when HA is enabled but RM_HA_IDS is not set to have multiple values of RM Id's. Additionally, I noticed that HAUtil cannot handle configs with spaces/tabs/return values, because HAUtil uses Configuration#get(), not getTrimmed(). This patch fixes it. The log messages with this patch are as follows. A case RM_HA_ID is empty: {code} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Invalid configuration! yarn.resourcemanager.ha.id needs to be set in a HA configuration {code} A case RM_HA_ID is invalid: {code} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Invalid configuration! Invalid value of yarn.resourcemanager.ha.id. Current value is .rm1 {code} A case RM_HA_IDS is empty or invalid: {code} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Invalid configuration! yarn.resourcemanager.ha.rm-ids is invalid. Current value is null {code} A case RM_HA_IDS doesn't contain RM_HA_ID: {code} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Invalid configuration! yarn.resourcemanager.ha.rm-ids([rm2, rm3]) need to contain yarn.resourcemanager.ha.id(rm1) in a HA configuration. {code} A case HAUtil.RPC_ADDRESS_CONF_KEYS related configuration is not set: {code} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Invalid configuration! yarn.resourcemanager.address.rm1 needs to be set in a HA configuration. {code} RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799262#comment-13799262 ] Chris Riccomini commented on YARN-614: -- Hey Steve, Sadly, no. I haven't had time to rebase/make the tests work. Sorry :/ Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Chris Riccomini Fix For: 2.3.0 Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799281#comment-13799281 ] Hadoop QA commented on YARN-1305: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609167/YARN-1305.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2219//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2219//console This message is automatically generated. RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1320) Custom log4j properties does not work properly.
Tassapol Athiapinya created YARN-1320: - Summary: Custom log4j properties does not work properly. Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Fix For: 2.2.1 Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1052) Enforce submit application queue ACLs outside the scheduler
[ https://issues.apache.org/jira/browse/YARN-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1052: -- Issue Type: Sub-task (was: Improvement) Parent: YARN-1317 Enforce submit application queue ACLs outside the scheduler --- Key: YARN-1052 URL: https://issues.apache.org/jira/browse/YARN-1052 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Assignee: Xuan Gong Per discussion in YARN-899, schedulers should not need to enforce queue ACLs on their own. Currently schedulers do this for application submission, and this should be done in the RM code instead. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1317) Make Queue, QueueACLs and QueueMetrics first class citizens in YARN
[ https://issues.apache.org/jira/browse/YARN-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799296#comment-13799296 ] Vinod Kumar Vavilapalli commented on YARN-1317: --- Thanks Sandy, didn't see that before. Made it a sub-task. Make Queue, QueueACLs and QueueMetrics first class citizens in YARN --- Key: YARN-1317 URL: https://issues.apache.org/jira/browse/YARN-1317 Project: Hadoop YARN Issue Type: Improvement Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Today, we are duplicating the exact same code in all the schedulers. Queue is a top class concept - clientService, web-services etc already recognize queue as a top level concept. We need to move Queue, QueueMetrics and QueueACLs to be top level. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799331#comment-13799331 ] Arun C Murthy commented on YARN-415: I'm sorry to come in late, I just did a cursory look. One question: Do we really need to track ResourceUsage for each Container? Can't we just add it up when a container finishes? Maybe I'm missing something? But, I'd like to not have a lot of per-container state if possible. Thanks. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-1320: --- Assignee: Xuan Gong Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799336#comment-13799336 ] Bikas Saha commented on YARN-1305: -- Nice update! We are almost there. After seeing the patch I am feeling that we should consolidate all these verifications into a single method that we call in HAService.serviceInit(). That way the get* methods will be simple and will not be performing checks all the time (its unnecessary after the first time). After the verification method has passed then we can confidently proceed in the remaining code. We can add more verifications of conf in the same method and ensure that we give a clean and user friendly YARN HA setup experience to users. What do you think? I did not see a test that verifies that more than 1 RM id must be specified in RM-HA-IDs? RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799354#comment-13799354 ] Hadoop QA commented on YARN-1320: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609175/YARN-1320.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2220//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2220//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2220//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2220//console This message is automatically generated. Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799363#comment-13799363 ] Andrey Klochkov commented on YARN-415: -- Arun, the idea is to have the stats being updated in real time while the app is running. Is there a way to get a list of running containers assigned to the app, with their start times, without tracking it explicitly? Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799372#comment-13799372 ] Xuan Gong commented on YARN-1320: - fix -1 findbug and -1release audit Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch, YARN-1320.2.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1320: Attachment: YARN-1320.2.patch Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch, YARN-1320.2.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1319) Documentation has wrong entry
[ https://issues.apache.org/jira/browse/YARN-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799383#comment-13799383 ] Tsuyoshi OZAWA commented on YARN-1319: -- Thank you for reporting, Siddharth. The point you mentioned is now being fixed on HADOOP-10050. Please watch it and have discussion there if you have any comments. Again, thanks! Documentation has wrong entry -- Key: YARN-1319 URL: https://issues.apache.org/jira/browse/YARN-1319 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.2.0 Environment: Linux Reporter: Siddharth Tiwari Priority: Minor Fix For: 2.2.0 The installation documentation for Hadoop yarn at this link http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html has error in the yarn-site for property yarn.nodemanager.aux-services. it should be mapreduce_shuffle rather than mapreduce.shuffle. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799391#comment-13799391 ] Hadoop QA commented on YARN-1320: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609181/YARN-1320.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2221//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2221//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2221//console This message is automatically generated. Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch, YARN-1320.2.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Reopened] (YARN-925) HistoryStorage Reader Interface for Application History Server
[ https://issues.apache.org/jira/browse/YARN-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reopened YARN-925: -- geAllApplications method will not work if we have tons of applications stored. Users should be allowed to add some filters. [~mayank_bansal], would you mind improving the reader interface? HistoryStorage Reader Interface for Application History Server -- Key: YARN-925 URL: https://issues.apache.org/jira/browse/YARN-925 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: YARN-321 Attachments: YARN-925-1.patch, YARN-925-2.patch, YARN-925-3.patch, YARN-925-4.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799417#comment-13799417 ] Tsuyoshi OZAWA commented on YARN-1305: -- Validation in RMHAProtocolService#initService is better idea. I also believe in that your proposal makes get* methods much simpler. I'll add HAUtil#validateConfiguration() and remove runtime verifications. I did not see a test that verifies that more than 1 RM id must be specified in RM-HA-IDs? This is missing point. I'll reflect this comment in next update. Thank you for your good suggestions! RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1321) NMTokenCache should not be a singleton
Alejandro Abdelnur created YARN-1321: Summary: NMTokenCache should not be a singleton Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache should not be a singleton
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799423#comment-13799423 ] Alejandro Abdelnur commented on YARN-1321: -- NMTokens are set in YARN AMRMClientImpl and MR RMContainerAllocator. And got in the YARNContainerManagementProtocolProxy via the NMTokenCache. We need to make the NMTokenCache instantiable and make sure each AM uses its own instance of it. IN the case of the YARN API this AMRMClientImpl and the NMClientImpl should share the same instance. NMTokenCache should not be a singleton -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799448#comment-13799448 ] Jason Lowe commented on YARN-415: - It's not just a real-time issue, it's also a correctness issue. When a container finishes we need to know the time it was allocated. So regardless of whether we want to compute the usage in real-time, the start time of a container and its resource sizes need to be tracked somewhere in the RM. ResourceUsage is just a Resource plus a start time, and the Resource should be referencing the same object already referenced by the Container inside RMContainerImpl. To implement this feature we need to track the containers that are allocated/running (already being done by RMContainerImpl) and what time they started (which we are not currently doing and why ResourceUsage was created). There is the issue of the HashMap to map a container ID to its resource and start time. We could remove the need for this if we stored the container start time in RMContainerImpl and had a safe way to lookup containers for an application attempt. We can get the containers for an application via scheduler.getSchedulerAppInfo, and RMAppAttemptImpl already does this when generating an app report. However since RMAppAttemptImpl and the scheduler are running in separate threads, I could see the scheduler already removing the container before RMAppAttemptImpl received the container completion event and tried to lookup the container for usage calculation. Given the race, along with the fact that getSchedulerAppInfo is not necessarily cheap, it seems reasonable to have RMAppAttemptImpl track what it needs for running containers directly. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1322) AHS History Store Cache Implementation
[ https://issues.apache.org/jira/browse/YARN-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1322: Description: AHS History Store Cache Implementation (was: Maybe we should include AHS classes as well (for developer usage) in yarn and yarn.cmd) AHS History Store Cache Implementation -- Key: YARN-1322 URL: https://issues.apache.org/jira/browse/YARN-1322 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal AHS History Store Cache Implementation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1322) AHS History Store Cache Implementation
Mayank Bansal created YARN-1322: --- Summary: AHS History Store Cache Implementation Key: YARN-1322 URL: https://issues.apache.org/jira/browse/YARN-1322 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Maybe we should include AHS classes as well (for developer usage) in yarn and yarn.cmd -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms
[ https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-884. --- Resolution: Won't Fix Target Version/s: (was: ) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms --- Key: YARN-884 URL: https://issues.apache.org/jira/browse/YARN-884 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Karthik Kambatla Assignee: Karthik Kambatla Labels: configuration Attachments: yarn-884-1.patch As the AM can't outlive the NM on which it is running, it is a good idea to disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher than nm.liveness-monitor.expiry-interval-ms -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1320: Attachment: YARN-1320.3.patch Fix -1 on findbug Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch, YARN-1320.2.patch, YARN-1320.3.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799501#comment-13799501 ] Hadoop QA commented on YARN-1320: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609201/YARN-1320.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build///testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build///console This message is automatically generated. Custom log4j properties does not work properly. --- Key: YARN-1320 URL: https://issues.apache.org/jira/browse/YARN-1320 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1320.1.patch, YARN-1320.2.patch, YARN-1320.3.patch Distributed shell cannot pick up custom log4j properties (specified with -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799504#comment-13799504 ] Sandy Ryza commented on YARN-1222: -- {code} LOG.error(Error in storing master key with KeyID: + newKey.getKeyId()); + LOG.error(Exception stack trace, e); {code} Why not put the exception in the first LOG.error? Make improvements in ZKRMStateStore for fencing --- Key: YARN-1222 URL: https://issues.apache.org/jira/browse/YARN-1222 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla Attachments: yarn-1222-1.patch, yarn-1222-2.patch Using multi-operations for every ZK interaction. In every operation, automatically creating/deleting a lock znode that is the child of the root znode. This is to achieve fencing by modifying the create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1121: - Fix Version/s: (was: 2.2.0) 2.2.1 RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Fix For: 2.2.1 on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-740) Document the YARN service lifecycle development
[ https://issues.apache.org/jira/browse/YARN-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-740: Affects Version/s: (was: 2.0.4-alpha) 2.2.0 Document the YARN service lifecycle development - Key: YARN-740 URL: https://issues.apache.org/jira/browse/YARN-740 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Affects Versions: 2.2.0 Reporter: Steve Loughran Assignee: Steve Loughran Original Estimate: 4h Remaining Estimate: 4h Once the API is stable, document how to write YARN services. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1053: - Fix Version/s: (was: 2.2.0) 2.2.1 Diagnostic message from ContainerExitEvent is ignored in ContainerImpl -- Key: YARN-1053 URL: https://issues.apache.org/jira/browse/YARN-1053 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Labels: newbie Fix For: 2.3.0, 2.2.1 Attachments: YARN-1053.20130809.patch If the container launch fails then we send ContainerExitEvent. This event contains exitCode and diagnostic message. Today we are ignoring diagnostic message while handling this event inside ContainerImpl. Fixing it as it is useful in diagnosing the failure. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1158) ResourceManager UI has application stdout missing if application stdout is not in the same directory as AppMaster stdout
[ https://issues.apache.org/jira/browse/YARN-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1158: - Fix Version/s: (was: 2.2.0) 2.2.1 ResourceManager UI has application stdout missing if application stdout is not in the same directory as AppMaster stdout Key: YARN-1158 URL: https://issues.apache.org/jira/browse/YARN-1158 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Tassapol Athiapinya Fix For: 2.2.1 Configure yarn-site.xml's yarn.nodemanager.local-dirs to multiple directories. Turn on log aggregation. Run distributed shell application. If an application writes AppMaster.stdout in one directory and stdout in another directory. Goto ResourceManager web UI. Open up container logs. Only AppMaster.stdout would appear. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1022) Unnecessary INFO logs in AMRMClientAsync
[ https://issues.apache.org/jira/browse/YARN-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1022: - Fix Version/s: (was: 2.2.0) 2.2.1 Unnecessary INFO logs in AMRMClientAsync Key: YARN-1022 URL: https://issues.apache.org/jira/browse/YARN-1022 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Priority: Minor Labels: newbie Fix For: 2.2.1 Logs like the following should be debug or else every legitimate stop causes unnecessary exception traces in the logs. 464 2013-08-03 20:01:34,459 INFO [AMRM Heartbeater thread] org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Heartbeater interrupted 465 java.lang.InterruptedException: sleep interrupted 466 at java.lang.Thread.sleep(Native Method) 467 at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:249) 468 2013-08-03 20:01:34,460 INFO [AMRM Callback Handler Thread] org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl: Interrupted while waiting for queue 469 java.lang.InterruptedException 470 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer. java:1961) 471 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1996) 472 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) 473 at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1142) MiniYARNCluster web ui does not work properly
[ https://issues.apache.org/jira/browse/YARN-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1142: - Fix Version/s: (was: 2.2.0) 2.2.1 MiniYARNCluster web ui does not work properly - Key: YARN-1142 URL: https://issues.apache.org/jira/browse/YARN-1142 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Alejandro Abdelnur Fix For: 2.2.1 When going to the RM http port, the NM web ui is displayed. It seems there is a singleton somewhere that breaks things when RM NMs run in the same process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1234) Container localizer logs are not created in secured cluster
[ https://issues.apache.org/jira/browse/YARN-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1234: - Fix Version/s: (was: 2.2.0) 2.2.1 Container localizer logs are not created in secured cluster Key: YARN-1234 URL: https://issues.apache.org/jira/browse/YARN-1234 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Fix For: 2.2.1 When we are running ContainerLocalizer in secured cluster we potentially are not creating any log file to track log messages. This will be helpful in potentially identifying ContainerLocalization issues in secured cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1053: Priority: Blocker (was: Major) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl -- Key: YARN-1053 URL: https://issues.apache.org/jira/browse/YARN-1053 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Blocker Labels: newbie Fix For: 2.3.0, 2.2.1 Attachments: YARN-1053.20130809.patch If the container launch fails then we send ContainerExitEvent. This event contains exitCode and diagnostic message. Today we are ignoring diagnostic message while handling this event inside ContainerImpl. Fixing it as it is useful in diagnosing the failure. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-891) Store completed application information in RM state store
[ https://issues.apache.org/jira/browse/YARN-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-891: - Attachment: YARN-891.1.patch New patch created a new RMAppRecoveredTransition for recover flow and get rid of the isFinalSavingRequestSent flag Store completed application information in RM state store - Key: YARN-891 URL: https://issues.apache.org/jira/browse/YARN-891 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-891.1.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch Add information like exit status etc for the completed attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799549#comment-13799549 ] Karthik Kambatla commented on YARN-1305: bq. I did not see a test that verifies that more than 1 RM id must be specified in RM-HA-IDs? I don't think this needs to be a requirement. An empty value for RM-HA-IDs is a problem but having 1 RM id is not. We can may be warn the user, but continue to run. RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-891) Store completed application information in RM state store
[ https://issues.apache.org/jira/browse/YARN-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799554#comment-13799554 ] Hadoop QA commented on YARN-891: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609212/YARN-891.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2223//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2223//console This message is automatically generated. Store completed application information in RM state store - Key: YARN-891 URL: https://issues.apache.org/jira/browse/YARN-891 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-891.1.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch Add information like exit status etc for the completed attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1139) [Umbrella] Convert all RM components to Services
[ https://issues.apache.org/jira/browse/YARN-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799553#comment-13799553 ] Tsuyoshi OZAWA commented on YARN-1139: -- [~ste...@apache.org], could you also check YARN-1305 and review a patch? The JIRA is subtask of this JIRA. [Umbrella] Convert all RM components to Services Key: YARN-1139 URL: https://issues.apache.org/jira/browse/YARN-1139 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Some of the RM components - state store, scheduler etc. are not services. Converting them to services goes well with the Always On and Active service separation proposed on YARN-1098. Given that some of them already have start(), stop() methods, it should not be too hard to convert them to services. That would also be a cleaner way of addressing YARN-1125. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1323) Set HTTPS webapp address along with other RPC addresses
Karthik Kambatla created YARN-1323: -- Summary: Set HTTPS webapp address along with other RPC addresses Key: YARN-1323 URL: https://issues.apache.org/jira/browse/YARN-1323 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla YARN-1232 adds the ability to configure multiple RMs, but missed out the https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1172) Convert *SecretManagers in the RM to services
[ https://issues.apache.org/jira/browse/YARN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799563#comment-13799563 ] Tsuyoshi OZAWA commented on YARN-1172: -- [~kkambatl], do you have feedbacks about this JIRA? Convert *SecretManagers in the RM to services - Key: YARN-1172 URL: https://issues.apache.org/jira/browse/YARN-1172 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1172.1.patch, YARN-1172.2.patch, YARN-1172.3.patch, YARN-1172.4.patch, YARN-1172.5.patch, YARN-1172.6.patch, YARN-1172.7.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1324) NodeManager should assign 1 local directory directory to a container
Bikas Saha created YARN-1324: Summary: NodeManager should assign 1 local directory directory to a container Key: YARN-1324 URL: https://issues.apache.org/jira/browse/YARN-1324 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Bikas Saha Currently, for every container, the NM creates a directory on every disk and expects the container-task to choose 1 of them and load balance the use of the disks across all containers. 1) This may have worked fine in the MR world where MR tasks would randomly choose dirs but in general we cannot expect every app/task writer to understand these nuances and randomly pick disks. So we could end up overloading the first disk if most people decide to use the first disk. 2) This makes a number of NM operations to scan every disk (thus randomizing that disk) to locate the dir which the task has actually chosen to use for its files. Makes all these operations expensive for the NM as well as disruptive for users of disks that did not have the real task working dirs. I propose that NM should up-front decide the disk it is assigning to tasks. It could choose to do so randomly or weighted-randomly by looking at space and load on each disk. So it could do a better job of load balancing. Then, it would associate the chosen working directory with the container context so that subsequent operations on the NM can directly seek to the correct location instead of having to seek on every disk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799638#comment-13799638 ] Tsuyoshi OZAWA commented on YARN-1305: -- After reading Karthik's comment, I was thinking use cases when we enable RM HA configuration without multiple RM ids. It's useful in following cases: 1. Developing. 2. Testing. 3. Manual failover(?) Therefore, we should support it IMO. I came up with another idea to support strict mode to stop RM with a wrong configuration when RM startup as Bikas mentioned. It's useful to detect wrong operations. However, it's not time to do this IMO, because we're still developing RM HA now. After getting stable, we should support the strict mode. Thoughts? RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799643#comment-13799643 ] Bikas Saha commented on YARN-1305: -- Sure. we can add the multiple RM's check later on. please open a sub-task for it. RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799655#comment-13799655 ] Tsuyoshi OZAWA commented on YARN-1305: -- Filed YARN-1325 for the multiple RM's check. RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1288) Make Fair Scheduler ACLs more user friendly
[ https://issues.apache.org/jira/browse/YARN-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1288: - Attachment: YARN-1288-3.patch Make Fair Scheduler ACLs more user friendly --- Key: YARN-1288 URL: https://issues.apache.org/jira/browse/YARN-1288 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1288-1.patch, YARN-1288-2.patch, YARN-1288-3.patch, YARN-1288.patch The Fair Scheduler currently defaults the root queue's acl to empty and all other queues' acl to *. Now that YARN-1258 enables configuring the root queue, we should reverse this. This will also bring the Fair Scheduler in line with the Capacity Scheduler. We should also not trim the acl strings, which makes it impossible to only specify groups in an acl. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1324) NodeManager should assign 1 local directory directory to a container
[ https://issues.apache.org/jira/browse/YARN-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799667#comment-13799667 ] Vinod Kumar Vavilapalli commented on YARN-1324: --- It wasn't done like that with MR world only in mind. Even outside MR, many apps want to write data in parallel and want to take advantage of multiple disks. We cannot make NM to decide one disk because of that. Apps/containers that don't care about load-balancing or multiple disks can chose to always write to the first disk and NM will eventually load balance them. To have true load-balancing all the time (and not just post container finish), YARN needs cooperative containers. And the better solution for that is to make apps ask the number of disks to write when they launch containers. That way YARN isn't overriding users intention to use/not use multiple disks. The title should be changed with problem description (and not the solution). NodeManager should assign 1 local directory directory to a container Key: YARN-1324 URL: https://issues.apache.org/jira/browse/YARN-1324 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Bikas Saha Currently, for every container, the NM creates a directory on every disk and expects the container-task to choose 1 of them and load balance the use of the disks across all containers. 1) This may have worked fine in the MR world where MR tasks would randomly choose dirs but in general we cannot expect every app/task writer to understand these nuances and randomly pick disks. So we could end up overloading the first disk if most people decide to use the first disk. 2) This makes a number of NM operations to scan every disk (thus randomizing that disk) to locate the dir which the task has actually chosen to use for its files. Makes all these operations expensive for the NM as well as disruptive for users of disks that did not have the real task working dirs. I propose that NM should up-front decide the disk it is assigning to tasks. It could choose to do so randomly or weighted-randomly by looking at space and load on each disk. So it could do a better job of load balancing. Then, it would associate the chosen working directory with the container context so that subsequent operations on the NM can directly seek to the correct location instead of having to seek on every disk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms
[ https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799675#comment-13799675 ] Vinod Kumar Vavilapalli commented on YARN-884: -- bq. Vinod Kumar Vavilapalli, partly agree with you that they are two different knobs. However, at least in the current implementation, restarting an NM cleans up all the containers on it (correct me if I am wrong) including the AM. In that scenario, having a higher value for AM_EXPIRY will only delay starting the AM. No? That is just a temporary artifact of us not having work-preserving restart. That shouldn't change our meaning of long term configuration properties. AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms --- Key: YARN-884 URL: https://issues.apache.org/jira/browse/YARN-884 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Karthik Kambatla Assignee: Karthik Kambatla Labels: configuration Attachments: yarn-884-1.patch As the AM can't outlive the NM on which it is running, it is a good idea to disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher than nm.liveness-monitor.expiry-interval-ms -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1288) Make Fair Scheduler ACLs more user friendly
[ https://issues.apache.org/jira/browse/YARN-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799680#comment-13799680 ] Hadoop QA commented on YARN-1288: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609237/YARN-1288-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2224//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2224//console This message is automatically generated. Make Fair Scheduler ACLs more user friendly --- Key: YARN-1288 URL: https://issues.apache.org/jira/browse/YARN-1288 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1288-1.patch, YARN-1288-2.patch, YARN-1288-3.patch, YARN-1288.patch The Fair Scheduler currently defaults the root queue's acl to empty and all other queues' acl to *. Now that YARN-1258 enables configuring the root queue, we should reverse this. This will also bring the Fair Scheduler in line with the Capacity Scheduler. We should also not trim the acl strings, which makes it impossible to only specify groups in an acl. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1305: - Attachment: YARN-1305.5.patch Updated patches based on Bikas' and Karthik's comments. * Moved all verifications to verifyConfiguration, and call it in RMHAProtocolService#serviceInit(). * Created verifyRMHAIds()/verifyRMHAId()/verifyAllRpcAddresses() methods. They verify configuration values and log verification error. * For now, a configuration contains only one RM-IDs, log it as warning as Karthik described. Log format is same at the last patch. Additionally, a case a configuration contains only one RM-IDs: {quote} 2013-10-19 00:18:29,698 WARN org.apache.hadoop.yarn.conf.HAUtil: Resource Manager HA is enabled, but yarn.resourcemanager.ha.rm-ids has only one id([rm1]) {quote} RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch, YARN-1305.5.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1324) NodeManager should assign 1 local directory directory to a container
[ https://issues.apache.org/jira/browse/YARN-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799698#comment-13799698 ] Bikas Saha commented on YARN-1324: -- Are there current applications that want to write in parallel to multiple local disks? If not, then we should probably figure out how to support them well when they show up. In the meanwhile, we could look at the above mentioned drawbacks and decide whether the they are worth fixing or not, either by restricting solution above or some other solution. Are the above drawbacks worthwhile issues? If yes, are there alternative proposals for a solution? NodeManager should assign 1 local directory directory to a container Key: YARN-1324 URL: https://issues.apache.org/jira/browse/YARN-1324 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Bikas Saha Currently, for every container, the NM creates a directory on every disk and expects the container-task to choose 1 of them and load balance the use of the disks across all containers. 1) This may have worked fine in the MR world where MR tasks would randomly choose dirs but in general we cannot expect every app/task writer to understand these nuances and randomly pick disks. So we could end up overloading the first disk if most people decide to use the first disk. 2) This makes a number of NM operations to scan every disk (thus randomizing that disk) to locate the dir which the task has actually chosen to use for its files. Makes all these operations expensive for the NM as well as disruptive for users of disks that did not have the real task working dirs. I propose that NM should up-front decide the disk it is assigning to tasks. It could choose to do so randomly or weighted-randomly by looking at space and load on each disk. So it could do a better job of load balancing. Then, it would associate the chosen working directory with the container context so that subsequent operations on the NM can directly seek to the correct location instead of having to seek on every disk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1324) NodeManager potentially causes unnecessary operations on all its disks
[ https://issues.apache.org/jira/browse/YARN-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-1324: - Summary: NodeManager potentially causes unnecessary operations on all its disks (was: NodeManager should assign 1 local directory directory to a container) NodeManager potentially causes unnecessary operations on all its disks -- Key: YARN-1324 URL: https://issues.apache.org/jira/browse/YARN-1324 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Bikas Saha Currently, for every container, the NM creates a directory on every disk and expects the container-task to choose 1 of them and load balance the use of the disks across all containers. 1) This may have worked fine in the MR world where MR tasks would randomly choose dirs but in general we cannot expect every app/task writer to understand these nuances and randomly pick disks. So we could end up overloading the first disk if most people decide to use the first disk. 2) This makes a number of NM operations to scan every disk (thus randomizing that disk) to locate the dir which the task has actually chosen to use for its files. Makes all these operations expensive for the NM as well as disruptive for users of disks that did not have the real task working dirs. I propose that NM should up-front decide the disk it is assigning to tasks. It could choose to do so randomly or weighted-randomly by looking at space and load on each disk. So it could do a better job of load balancing. Then, it would associate the chosen working directory with the container context so that subsequent operations on the NM can directly seek to the correct location instead of having to seek on every disk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799707#comment-13799707 ] Vinod Kumar Vavilapalli commented on YARN-1185: --- Patch looks good to me. Can you address the test-issue? FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1185.1.patch, YARN-1185.2.patch FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799713#comment-13799713 ] Hadoop QA commented on YARN-1305: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609241/YARN-1305.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.conf.TestHAUtil {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2225//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2225//console This message is automatically generated. RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch, YARN-1305.5.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1185: Attachment: YARN-1185.3.patch FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1326: - Attachment: YARN-1326.1.patch This patch enables RM to log using RMStore as follows. {code} org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore is used for ResourceManager HA {code} RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache should not be a singleton
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799736#comment-13799736 ] Omkar Vinit Joshi commented on YARN-1321: - why are you in fact running multiple AM's inside a same JVM? as per YARN we can never have multiple AM's per JVM per process. Definitely not a blocker. Please explain the use case for running multiple AMs inside same process? If you really want to run it that way ..Why not just update the NMTokenCache but default to single AM case but still I don't see why you are doing this? NMTokenCache should not be a singleton -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799734#comment-13799734 ] Hadoop QA commented on YARN-1326: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609248/YARN-1326.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2227//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2227//console This message is automatically generated. RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache should not be a singleton
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799735#comment-13799735 ] Vinod Kumar Vavilapalli commented on YARN-1321: --- Why is this a blocker? Don't think it is, multiple AMs in a JVM wasn't supported in a first class way - I'm sure you'll find more issues here. Also, please edit the title with the problem statement instead of the solution. Now as to more details: Don't know the internal details, so is llama running with multiple AMs one after another or in parallel? And is the context an unmanaged AM? NMTokenCache should not be a singleton -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly.
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1321: - Summary: NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly. (was: NMTokenCache should not be a singleton) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly. --- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly.
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799747#comment-13799747 ] Alejandro Abdelnur commented on YARN-1321: -- We run into this issue in Llama. Llama is a single JVM hosting multiple unmanaged ApplicationMasters that run at the same time (in parallel). Because NMTokenCache is a singleton NMTokens for the same node from the different AMs step on each other. The patch that I'm working preserves the current behavior (singleton NMTokenCache) while allowing a client to set a NMTokenCache instance to the AMRMClient/NMClient (and Async versions). If an instance is set, then the NMTokens are stored in it instead of the singleton. This preserves backward compatibility both in behavior and in API. NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly. --- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly.
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1321: - Attachment: YARN-1321.patch Attached a patch with the proposed solution. So far this is the only issue we've run while using multiple AMs in a single JVM. NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly. --- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 Attachments: YARN-1321.patch NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-1321: - Summary: NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly (was: NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly.) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 Attachments: YARN-1321.patch NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799753#comment-13799753 ] Tsuyoshi OZAWA commented on YARN-1326: -- This patch just adds to log, so no additional tests are needed. RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799770#comment-13799770 ] Hadoop QA commented on YARN-1321: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609251/YARN-1321.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2228//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2228//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2228//console This message is automatically generated. NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 Attachments: YARN-1321.patch NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1305) RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1305: - Attachment: YARN-1305.6.patch RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException --- Key: YARN-1305 URL: https://issues.apache.org/jira/browse/YARN-1305 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha Attachments: YARN-1305.1.patch, YARN-1305.2.patch, YARN-1305.3.patch, YARN-1305.4.patch, YARN-1305.5.patch, YARN-1305.6.patch When yarn.resourcemanager.ha.enabled is true, RMHAProtocolService#serviceInit calls HAUtil.setAllRpcAddresses. If the configuration values are null, it just throws IllegalArgumentException. It's messy to analyse which keys are null, so we should handle it and log the name of keys which are null. A current log dump is as follows: {code} 2013-10-15 06:24:53,431 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT] 2013-10-15 06:24:54,203 INFO org.apache.hadoop.service.AbstractService: Service RMHAProtocolService failed in state INITED; cause: java.lang.IllegalArgumentException: Property value must not be null java.lang.IllegalArgumentException: Property value must not be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.conf.Configuration.set(Configuration.java:816) at org.apache.hadoop.conf.Configuration.set(Configuration.java:798) at org.apache.hadoop.yarn.conf.HAUtil.setConfValue(HAUtil.java:100) at org.apache.hadoop.yarn.conf.HAUtil.setAllRpcAddresses(HAUtil.java:105) at org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService.serviceInit(RMHAProtocolService.java:60) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:940) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)