[jira] [Updated] (YARN-4499) Bad config values of "scheduler.maximum-allocation-vcores"
[ https://issues.apache.org/jira/browse/YARN-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianyin Xu updated YARN-4499: - Description: Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." At least, we should fix the doc. The error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (but seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) was: Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." The doc error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (but seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) > Bad config values of "scheduler.maximum-allocation-vcores" > -- > > Key: YARN-4499 > URL: https://issues.apache.org/jira/browse/YARN-4499 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.1, 2.6.2 >Reporter: Tianyin Xu > > Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} > is {{4}}, according to {{YarnConfiguration.java}} > {code} > public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = > YARN_PREFIX + "scheduler.maximum-allocation-vcores"; > public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; > {code} > However, according to > [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], > this value should be {{32}}. > Yes, this seems to be a doc error, but I feel that the default value should > be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is > {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only > allow the maximum of {{4}}... > The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) > |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] > also suggests that "the maximum value (of > {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to > {{yarn.nodemanager.resource.cpu-vcores}}..." > At least, we should fix the doc. The error is pretty bad. A simple search on > the Internet shows some ppl are confused by this error, for example, > https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-sch
[jira] [Updated] (YARN-4499) Bad config values of "scheduler.maximum-allocation-vcores"
[ https://issues.apache.org/jira/browse/YARN-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianyin Xu updated YARN-4499: - Description: Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." The doc error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (but seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) was: Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." The doc error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) > Bad config values of "scheduler.maximum-allocation-vcores" > -- > > Key: YARN-4499 > URL: https://issues.apache.org/jira/browse/YARN-4499 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.1, 2.6.2 >Reporter: Tianyin Xu > > Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} > is {{4}}, according to {{YarnConfiguration.java}} > {code} > public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = > YARN_PREFIX + "scheduler.maximum-allocation-vcores"; > public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; > {code} > However, according to > [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], > this value should be {{32}}. > Yes, this seems to be a doc error, but I feel that the default value should > be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is > {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only > allow the maximum of {{4}}... > The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) > |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] > also suggests that "the maximum value (of > {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to > {{yarn.nodemanager.resource.cpu-vcores}}..." > The doc error is pretty bad. A simple search on the Internet shows some ppl > are confused by this error, for example, > https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 > (but seriously, I think we should
[jira] [Updated] (YARN-4499) Bad config values of "scheduler.maximum-allocation-vcores"
[ https://issues.apache.org/jira/browse/YARN-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianyin Xu updated YARN-4499: - Description: Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." The doc error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) was: Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." The doc error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) > Bad config values of "scheduler.maximum-allocation-vcores" > -- > > Key: YARN-4499 > URL: https://issues.apache.org/jira/browse/YARN-4499 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.1, 2.6.2 >Reporter: Tianyin Xu > > Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} > is {{4}}, according to {{YarnConfiguration.java}} > {code} > public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = > YARN_PREFIX + "scheduler.maximum-allocation-vcores"; > public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; > {code} > However, according to > [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], > this value should be {{32}}. > Yes, this seems to be a doc error, but I feel that the default value should > be the same as {{yarn.nodemanager.resource.cpu-vcores}} (whose default is > {{8}}) ---if we have {{8}} cores for scheduling, there's few reason we only > allow the maximum of {{4}}... > The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) > |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] > also suggests that "the maximum value (of > {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to > {{yarn.nodemanager.resource.cpu-vcores}}..." > The doc error is pretty bad. A simple search on the Internet shows some ppl > are confused by this error, for example, > https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 > (seriously, I think we should have an automatic default which is
[jira] [Created] (YARN-4499) Bad config values of "scheduler.maximum-allocation-vcores"
Tianyin Xu created YARN-4499: Summary: Bad config values of "scheduler.maximum-allocation-vcores" Key: YARN-4499 URL: https://issues.apache.org/jira/browse/YARN-4499 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.6.2, 2.7.1 Reporter: Tianyin Xu Currently, the default value of {{yarn.scheduler.maximum-allocation-vcores}} is {{4}}, according to {{YarnConfiguration.java}} {code} public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = YARN_PREFIX + "scheduler.maximum-allocation-vcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES = 4; {code} However, according to [yarn-default.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml], this value should be {{32}}. Yes, this seems to be a doc error, but I feel that the default value should be the same as {{yarn.nodemanager.resource.cpu-vcores}} ---if we have {{8}} cores for scheduling, there's few reason we only allow the maximum of {{4}}... The Cloudera's article on [Tuning the Cluster for MapReduce v2 (YARN) |http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html] also suggests that "the maximum value (of {{yarn.nodemanager.resource.cpu-vcores}}) is usually equal to {{yarn.nodemanager.resource.cpu-vcores}}..." The doc error is pretty bad. A simple search on the Internet shows some ppl are confused by this error, for example, https://community.cloudera.com/t5/Cloudera-Manager-Installation/yarn-nodemanager-resource-cpu-vcores-and-yarn-scheduler-maximum/td-p/31098 (seriously, I think we should have an automatic default which is equal to the number of cores on the machine...) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2882) Add ExecutionType to denote if a container execution is GUARANTEED or OPPORTUNISTIC
[ https://issues.apache.org/jira/browse/YARN-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069226#comment-15069226 ] Arun Suresh commented on YARN-2882: --- bq. However, I think we still need a flag in ResourceRequest to describe if it's a "opportunistic" container or not, correct? Otherwise RM/LocalRM cannot decide if it can take risk to allocate a queueable/oversubscribe container. [~leftnoteasy], as [~ka...@cloudera.com] had mentioned, we had a pretty lengthy discussion about this.. finally we came to the consensus that we dissallow the AM from making that decision. # Firstly, The AM might not be in a position to know if a Request can be handled in a guaranteed or opportunistic manner. Having a pluggable policy on the NM that can filter some Resource Requests (for eg. Distributed Scheduling can first target non-strict locality requests) which can evolve to really dynamic policies based on load etc. # Secondly, this would ensure that delinquent AMs wont game the scheduler by asking for only guaranteed Resources. Hope this made sense > Add ExecutionType to denote if a container execution is GUARANTEED or > OPPORTUNISTIC > --- > > Key: YARN-2882 > URL: https://issues.apache.org/jira/browse/YARN-2882 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: YARN-2882-yarn-2877.001.patch, > YARN-2882-yarn-2877.002.patch, YARN-2882-yarn-2877.003.patch, > YARN-2882-yarn-2877.004.patch, yarn-2882.patch > > > This JIRA introduces the notion of container types. > We propose two initial types of containers: guaranteed-start and queueable > containers. > Guaranteed-start are the existing containers, which are allocated by the > central RM and are instantaneously started, once allocated. > Queueable is a new type of container, which allows containers to be queued in > the NM, thus their execution may be arbitrarily delayed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3870) Providing raw container request information for fine scheduling
[ https://issues.apache.org/jira/browse/YARN-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069208#comment-15069208 ] Arun Suresh commented on YARN-3870: --- Thank you for starting this discussion [~grey] Correct me if I am wrong, what you are proposing, I guess is some way for the Scheduler to co-relate the expanded Resource Requests. I do feel this would be genuinely useful, not only from a Scheduling perspective for eg. making affinity / anti-afinity scheduling decisions viz. YARN-1042. This will also greatly help improving pre-emption decisions in the FairScheduler viz. YARN-2154.. This would also be extremely useful for AMs too. Currently the MRAM does the book keeping and matches an allocated container to ResourceRequest. AMs can be generally relieved of this job if an allocated Container Token can easily be matched against a Resource Request. One possible approach could be to have the AMClient generate a unique id for a Resource request and tag each of the expanded requests (Node, Rack and ANY) with this id. This Id can then be passed around in the Container/ContainerTokenIdentifier. [~ka...@cloudera.com], [~vinodkv], [~leftnoteasy], Thoughts ? > Providing raw container request information for fine scheduling > --- > > Key: YARN-3870 > URL: https://issues.apache.org/jira/browse/YARN-3870 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, applications, capacityscheduler, fairscheduler, > resourcemanager, scheduler, yarn >Reporter: Lei Guo > > Currently, when AM sends container requests to RM and scheduler, it expands > individual container requests into host/rack/any format. For instance, if I > am asking for container request with preference "host1, host2, host3", > assuming all are in the same rack rack1, instead of sending one raw container > request to RM/Scheduler with raw preference list, it basically expand it to > become 5 different objects with host1, host2, host3, rack1 and any in there. > When scheduler receives information, it basically already lost the raw > request. This is ok for single container request, but it will cause trouble > when dealing with multiple container requests from the same application. > Consider this case: > 6 hosts, two racks: > rack1 (host1, host2, host3) rack2 (host4, host5, host6) > When application requests two containers with different data locality > preference: > c1: host1, host2, host4 > c2: host2, host3, host5 > This will end up with following container request list when client sending > request to RM/Scheduler: > host1: 1 instance > host2: 2 instances > host3: 1 instance > host4: 1 instance > host5: 1 instance > rack1: 2 instances > rack2: 2 instances > any: 2 instances > Fundamentally, it is hard for scheduler to make a right judgement without > knowing the raw container request. The situation will get worse when dealing > with affinity and anti-affinity or even gang scheduling etc. > We need some way to provide raw container request information for fine > scheduling purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4438) Implement RM leader election with curator
[ https://issues.apache.org/jira/browse/YARN-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069202#comment-15069202 ] Karthik Kambatla commented on YARN-4438: bq. And because ZKRMStateStore is currently in active service, it cannot be simply moved to AlwaysOn service. So, I'd like to do it separately to minimize the core change in this jira. Fine with separate JIRA. Not sure I understand why ZKRMStateStore needs to be an AlwaysOn service. bq. I'd like to change this part for RM to not refresh the configs if shared storage based config provider is not enabled. I was never a fan of the shared-storage-configuration stuff. Now that we have it, don't think we can get rid of it until Hadoop 4. How would this change look? The RM has an instance of the elector; every time we transition to active, will either the RM or the elector check if shared-storage-config-provider is enabled and call refresh? But yeah, I do see the point of calling these methods directly from RM. bq. To avoid a busy loop and rejoining immediately. If we rejoin immediately, one of the RMs would become Active. It is not like the RM is going to use the cycles for anything else if we sleep. Is the concern that Curator may be biased in picking an RM in certain conditions? bq. What do you mean by force give-up ? exit RM ? If leaderLatch.close() throws an exception, when does Curator realize the RM is not participating in the election anymore? If not, it might keep electing the same RM active? How do we handle this, and how long of a wait is okay? bq. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? Agree, there is no harm. My concern is about availability - having one of the RMs active "most" of the time. bq. I have one question about ActiveStandbyCheckThread. if we make zkStateStore and elector to share the same zkClient, do we still need the ActiveStandbyCheckThread ? the elector itself should get notification when the connection is lost. Are you referring to the VerifyActiveStatusThread? The RM loses leadership; the connection can be restored even if it loses. We could actively go stop the store if it hasn't already stopped. The store would have already gotten fenced, so we don't run the risk of corrupting the store. So, you are right, we might not need that thread. bq. This is currently what EmbeddedElectorService is doing. If the leadership is already lost from zk's perspective, the other RM should take up the leadership You are right, it isn't a big deal. Just realized EmbeddedElectorService does the same today. Haven't seen Curator's LeaderLatch code. What happens if this RM is subsequently elected leader? Does the transition to Active succeed just fine? Or, is it possible it gets stuck in a way it can't transition to active? If it gets into such a situation, we should consider crashing it altogether. bq. I think leaderLatch could never be null ? Seeing all the NPEs we have in RM/Scheduler, I would like for us to err on the side of caution and do null-checks. If not, we at least need to make it consistent everywhere. bq. Why does it need to be called outside of if (state == HAServiceProtocol.HAServiceState.ACTIVE) ? This is a fresh start, it does not need to call reinitiialize. You are right. Sorry for the noise, clearly it has been a while since I looked at this code. > Implement RM leader election with curator > - > > Key: YARN-4438 > URL: https://issues.apache.org/jira/browse/YARN-4438 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-4438.1.patch, YARN-4438.2.patch, YARN-4438.3.patch > > > This is to implement the leader election with curator instead of the > ActiveStandbyElector from common package, this also avoids adding more > configs in common to suit RM's own needs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the ATSv2 reader side REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069190#comment-15069190 ] Varun Saxena commented on YARN-4224: bq. It's about locating resources in the system. /flows will be a query endpoint and we can send query parameters there, but /flows/{uid} will locate one single flow. What I'm confusing right now is, why do we need to have both plural and singular forms /run/{uid} and /runs/{uid}? Will they locate to the same run, given the same UID? Ok. The way UID endpoints have been set up in the patch are {{\{resource to query\}/\{uid required to query that resource\}}}. So run means single run to query and runs means multiple runs to query. How do we differentiate between different endpoints then ? UID is not required to locate multiple flows, flowruns, apps and entities ? > Change the ATSv2 reader side REST interface to conform to current REST APIs' > in YARN > > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4224-YARN-2928.01.patch, > YARN-4224-feature-YARN-2928.wip.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3215) Respect labels in CapacityScheduler when computing headroom
[ https://issues.apache.org/jira/browse/YARN-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-3215: --- Assignee: Naganarasimha G R (was: Wangda Tan) > Respect labels in CapacityScheduler when computing headroom > --- > > Key: YARN-3215 > URL: https://issues.apache.org/jira/browse/YARN-3215 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Wangda Tan >Assignee: Naganarasimha G R > > In existing CapacityScheduler, when computing headroom of an application, it > will only consider "non-labeled" nodes of this application. > But it is possible the application is asking for labeled resources, so > headroom-by-label (like 5G resource available under node-label=red) is > required to get better resource allocation and avoid deadlocks such as > MAPREDUCE-5928. > This JIRA could involve both API changes (such as adding a > label-to-available-resource map in AllocateResponse) and also internal > changes in CapacityScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3215) Respect labels in CapacityScheduler when computing headroom
[ https://issues.apache.org/jira/browse/YARN-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069175#comment-15069175 ] Naganarasimha G R commented on YARN-3215: - Hi [~wangda], As i was discussing with you offline wrt YARN-4225, this jira would be important in a multitenant scenario. As tenant needs to know how much head room is available to them. So i want to assign this issue to myself Current Behavior, It gives the headroom of the default partition of a queue, for apps. And if the default partition has no nodes configured then minimum allocation cores and mb will be given as headroom. And also more erroneous situation is when the Default partition size is very large than a particular other partition, then Head room sent to app is larger than what it can use it might lead to hanging. I initially propose to send head room for all partitions accessible to a queue to the app. Will try to work on POC patch and share at the earliest > Respect labels in CapacityScheduler when computing headroom > --- > > Key: YARN-3215 > URL: https://issues.apache.org/jira/browse/YARN-3215 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > > In existing CapacityScheduler, when computing headroom of an application, it > will only consider "non-labeled" nodes of this application. > But it is possible the application is asking for labeled resources, so > headroom-by-label (like 5G resource available under node-label=red) is > required to get better resource allocation and avoid deadlocks such as > MAPREDUCE-5928. > This JIRA could involve both API changes (such as adding a > label-to-available-resource map in AllocateResponse) and also internal > changes in CapacityScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069174#comment-15069174 ] Vrushali C commented on YARN-4062: -- I will fix the findbug warning and take a look at the whitespace lines. > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.1.patch, > YARN-4062-feature-YARN-2928.01.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the ATSv2 reader side REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069172#comment-15069172 ] Varun Saxena commented on YARN-4224: bq. So the UID in /entities/{entitytype}/{uid}/ is actually app UID? This make the whole endpoint looks really weird... I thought it's an entity UID to locate to one timeline entity. However, I think you raised a very useful use case to query a certain type of entity for one application. Maybe we'd like to change the format of this endpoint to address this case? I don't really feel like the current form of the endpoint... When I had written the code,I was assuming delimiter wont be public so UID has to madatorily go from server. And we cant append entity type there. As we have reached consensus on making delimiter public, UI can actually append the entity type in front of UID. But this will have to be a special case for entities endpoint. bq. So to find one entity with cluster, user, flow, flowrun, appid and entity id, we do not have the hierarchical endpoint, but can only get an entity through the UID interface? Do we need the hierarchical interface for CLIs? Not really. In case of hierarchical endpoint, flow context information can be supplied as part of optional query parameters. This would preclude the need of query flow context. But for hierarchical queries we envisage direct queries too rather than only flows->flowruns->apps->entities sequence when query via UID. But now as delimiter can be public, UID endpoint can also have a direct query potentially. So can have similar structure as for hierarchical query. I think we can discuss all these points in detail in today's call. > Change the ATSv2 reader side REST interface to conform to current REST APIs' > in YARN > > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4224-YARN-2928.01.patch, > YARN-4224-feature-YARN-2928.wip.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069147#comment-15069147 ] Hadoop QA commented on YARN-3480: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 29s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 49s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 155m 0s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12779182/YARN-3480.13.patch | | JIRA Issue | YARN-3480 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs ch
[jira] [Commented] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069142#comment-15069142 ] Hadoop QA commented on YARN-4304: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 13s {color} | {color:red} Patch generated 20 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 252, now 264). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 4 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 5s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 23s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 137m 6s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12779183/0007-YARN-4304.pat
[jira] [Updated] (YARN-4498) Application level node labels stats to be available in UI/REST/CLI
[ https://issues.apache.org/jira/browse/YARN-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4498: --- Summary: Application level node labels stats to be available in UI/REST/CLI (was: Application level running labels/stats to be available to UI/REST/CLI) > Application level node labels stats to be available in UI/REST/CLI > -- > > Key: YARN-4498 > URL: https://issues.apache.org/jira/browse/YARN-4498 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > > Currently nodelabel stats per application is not available through > REST/CLI/UI like currently used labels by all live containers, total stats of > containers per label for app etc.. > Will split into multiple task if approach is fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4498) Application level running labels/stats to be available to UI/REST/CLI
Bibin A Chundatt created YARN-4498: -- Summary: Application level running labels/stats to be available to UI/REST/CLI Key: YARN-4498 URL: https://issues.apache.org/jira/browse/YARN-4498 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Currently nodelabel stats per application is not available through REST/CLI/UI like currently used labels by all live containers, total stats of containers per label for app etc.. Will split into multiple task if approach is fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in secure clusters
[ https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069121#comment-15069121 ] zhangshilong commented on YARN-4327: yeah,I tried yarn.timeline-service.http-authentication.type=kerberos. So jobs could be submitted, but users can not access application history from webapp. > RM can not renew TIMELINE_DELEGATION_TOKEN in secure clusters > -- > > Key: YARN-4327 > URL: https://issues.apache.org/jira/browse/YARN-4327 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.7.1 > Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use > kerberos security. > conf like this: > > hadoop.security.authorization > true > Is service-level authorization enabled? > > > hadoop.security.authentication > kerberos > Possible values are simple (no authentication), and kerberos > > >Reporter: zhangshilong > > bin hadoop 2.7.1 > ATS conf like this: > > yarn.timeline-service.http-authentication.type > simple > > > yarn.timeline-service.http-authentication.kerberos.principal > HTTP/_h...@xxx.com > > > yarn.timeline-service.http-authentication.kerberos.keytab > /etc/hadoop/keytabs/xxx.keytab > > > yarn.timeline-service.principal > xxx/_h...@xxx.com > > > yarn.timeline-service.keytab > /etc/hadoop/keytabs/xxx.keytab > > > yarn.timeline-service.best-effort > true > > > yarn.timeline-service.enabled > true > > > I'd like to allow everyone to access ATS from HTTP as RM,HDFS. > client can submit job to RM and add TIMELINE_DELEGATION_TOKEN to AM > Context, but RM can not renew TIMELINE_DELEGATION_TOKEN and make application > to failure. > RM logs: > 2015-11-03 11:58:38,191 WARN > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: > Unable to add the application to the delegation token renewer. > java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, > Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, > realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, > masterKeyId=2) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: HTTP status [500], message [Null user] > at > org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400) > at > org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTok
[jira] [Commented] (YARN-4109) Exception on RM scheduler page loading with labels
[ https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069113#comment-15069113 ] Hudson commented on YARN-4109: -- FAILURE: Integrated in Hadoop-trunk-Commit #9017 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9017/]) YARN-4109. Exception on RM scheduler page loading with labels. (Mohammad (rohithsharmaks: rev 8c180a13c82ab9d60f595e6942e35d51024dab53) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/CHANGES.txt > Exception on RM scheduler page loading with labels > -- > > Key: YARN-4109 > URL: https://issues.apache.org/jira/browse/YARN-4109 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Mohammad Shahid Khan >Priority: Minor > Fix For: 2.9.0 > > Attachments: YARN-4109_1.patch > > > Configure node label and load scheduler Page > On each reload of the page the below exception gets thrown in logs > {code} > 2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615) > at > org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) >
[jira] [Created] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing
Jun Gong created YARN-4497: -- Summary: RM might fail to restart when recovering apps whose attempts are missing Key: YARN-4497 URL: https://issues.apache.org/jira/browse/YARN-4497 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong Find following problem when discussing in YARN-3480. If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing
[ https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-4497: --- Description: Find following problem when discussing in YARN-3480. If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). was: Find following problem when discussing in YARN-3480. If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). > RM might fail to restart when recovering apps whose attempts are missing > > > Key: YARN-4497 > URL: https://issues.apache.org/jira/browse/YARN-4497 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jun Gong >Assignee: Jun Gong > > Find following problem when discussing in YARN-3480. > If RM fails to store some attempts in RMStateStore, there will be missing > attempts in RMStateStore, for the case storing attempt1, attempt2 and > attempt3, RM successfully stored attempt1 and attempt3, but failed to store > attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one > by one, for this case, we will recover attmept1, then attempt2. When > recovering attempt2, we call > *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find > its ApplicationAttemptStateData, but it could not find it, an error will come > at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4492) Add documentation for queue level preemption which is supported in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069093#comment-15069093 ] Naganarasimha G R commented on YARN-4492: - Thanks [~templedf], for the review Hope [~jlowe] / [~wangda], (reviewers of YARN-2056) can take a look at this ? > Add documentation for queue level preemption which is supported in Capacity > scheduler > - > > Key: YARN-4492 > URL: https://issues.apache.org/jira/browse/YARN-4492 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Minor > Attachments: CapacityScheduler.html, YARN-4492.v1.001.patch, > YARN-4492.v1.002.patch, YARN-4492.v1.003.patch > > > As part of YARN-2056, Support has been added to disable preemption for a > specific queue. This is a useful feature in a multiload cluster but currently > missing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient
[ https://issues.apache.org/jira/browse/YARN-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069081#comment-15069081 ] Rohith Sharma K S commented on YARN-4352: - [~sunilg] can you check test failures, I checked logs, it seems to be failures are related to this patch. > Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient > > > Key: YARN-4352 > URL: https://issues.apache.org/jira/browse/YARN-4352 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Sunil G > Labels: security > Attachments: 0001-YARN-4352.patch > > > From > https://builds.apache.org/job/PreCommit-YARN-Build/9661/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client-jdk1.7.0_79.txt, > we can see the tests in TestYarnClient, TestAMRMClient and TestNMClient get > timeout which can be reproduced locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4195) Support of node-labels in the ReservationSystem "Plan"
[ https://issues.apache.org/jira/browse/YARN-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069080#comment-15069080 ] Wangda Tan commented on YARN-4195: -- Hi [~curino], Thanks for responses, the "unified" label is a great idea, in the future, we could add more stuffs such as affinity/anti-affiity to it. (dynamic label to say if an app can assign container to a node or not). The dimension of node labels sounds not an issue since it has hard limit ({{|partitions|<=K}}, and YARN-4476). I still have some questions. (Just rephrase my previous questions) 1) If we're configuring a scheduler, should we use the individual label (A/B) or normalized label (A/B/A_B)? 2) If a resource request want to allocate on "A", is it possible to get resource on "A_B"? If yes, is it means scheduler allocates more resource than required? (app wants 3GB on "A" only, but scheduler gives app 3GB on "A" and "B"). > Support of node-labels in the ReservationSystem "Plan" > -- > > Key: YARN-4195 > URL: https://issues.apache.org/jira/browse/YARN-4195 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-4195.patch > > > As part of YARN-4193 we need to enhance the InMemoryPlan (and related > classes) to track the per-label available resources, as well as the per-label > reservation-allocations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent
[ https://issues.apache.org/jira/browse/YARN-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069062#comment-15069062 ] Sunil G commented on YARN-4003: --- Hi [~curino] bq.Assume a reservation queue R1 launches tons of AMs, and now another reservation queue R2 is stuck not being able to run any job Yes, I agree that with one sudden spike of usage from a queue can make other queues to starve. IN this case, I agree with current solution, it can definitely help to overcome starvation and adhere to a limit also (if not the worst case). Long term plan sounds interesting. {{RM scheduling bandwidth}} per queue is definitely a good metric here. I assume this metric should be less that queue capacity always (not max-capacity), is it so? {{cost of scheduler bandwidth}} is a metric which was long pending. However, it is more like a ranking too. Like, ranking all apps based on its demand (resource request) rate, and compute that total cost to the queue level. RM can make use of this to know which queue has less cost and which has more. So in reservation case, AMs can be restricted more in a queue with high cost for scheduler bandwidth. thoughts? > ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is > not consistent > > > Key: YARN-4003 > URL: https://issues.apache.org/jira/browse/YARN-4003 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-4003.patch > > > The inherited behavior from LeafQueue (limit AM % based on capacity) is not a > good fit for ReservationQueue (that have highly dynamic capacity). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4304: -- Attachment: 0007-YARN-4304.patch Hi [~leftnoteasy] Thank you for sharing the comments. Uploading a new patch addressing the comments. bq. And I think you can move following code to CapacitySchedulerLeafQueueInfo: I think if we move to {{CapacitySchedulerLeafQueueInfo}}, we may need to have this calculated for all accessible labels in the queue. So it may need to have a new list in lqInfo. In this case, is it fine to to keep in {{CapacitySchedulerPage}} itself. Thoughts? As discussed offline, we will pre-compute AMLimit for all labels in queue prior to the loop in {{activateApplications}} > AM max resource configuration per partition to be displayed/updated correctly > in UI and in various partition related metrics > > > Key: YARN-4304 > URL: https://issues.apache.org/jira/browse/YARN-4304 > Project: Hadoop YARN > Issue Type: Sub-task > Components: webapp >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, > 0003-YARN-4304.patch, 0004-YARN-4304.patch, 0005-YARN-4304.patch, > 0005-YARN-4304.patch, 0006-YARN-4304.patch, 0007-YARN-4304.patch, > REST_and_UI.zip > > > As we are supporting per-partition level max AM resource percentage > configuration, UI and various metrics also need to display correct > configurations related to same. > For eg: Current UI still shows am-resource percentage per queue level. This > is to be updated correctly when label config is used. > - Display max-am-percentage per-partition in Scheduler UI (label also) and in > ClusterMetrics page > - Update queue/partition related metrics w.r.t per-partition > am-resource-percentage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069045#comment-15069045 ] Jun Gong commented on YARN-3480: Discussed with [~jianhe] offline, we think the implementation(YARN-3480.12.patch) is a bit complex and it’s OK that the number of attempts kept in store is not so accurate. So reattach previous patch(YARN-3480.11.patch, rename it to YARN-3480.13.patch). > Recovery may get very slow with lots of services with lots of app-attempts > -- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, > YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, > YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, > YARN-3480.09.patch, YARN-3480.10.patch, YARN-3480.11.patch, > YARN-3480.12.patch, YARN-3480.13.patch > > > When RM HA is enabled and running containers are kept across attempts, apps > are more likely to finish successfully with more retries(attempts), so it > will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However > it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make > RM recover process much slower. It might be better to set max attempts to be > stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4496) Improve HA ResourceManager Failover detection on the client
[ https://issues.apache.org/jira/browse/YARN-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069043#comment-15069043 ] Arun Suresh commented on YARN-4496: --- Had a discussion with [~subru], and we think this would also be useful for YARN federation (YARN-2915) as well > Improve HA ResourceManager Failover detection on the client > --- > > Key: YARN-4496 > URL: https://issues.apache.org/jira/browse/YARN-4496 > Project: Hadoop YARN > Issue Type: Improvement > Components: client, resourcemanager >Reporter: Arun Suresh >Assignee: Arun Suresh > > HDFS deployments can currently use the {{RequestHedgingProxyProvider}} to > improve Namenode failover detection in the client. It does this by > concurrently trying all namenodes and picks the namenode that returns the > fastest with a successful response as the active node. > It would be useful to have a similar ProxyProvider for the Yarn RM (it can > possibly be done by converging some the class hierarchies to use the same > ProxyProvider) > This would especially be useful for large YARN deployments with multiple > standby RMs where clients will be able to pick the active RM without having > to traverse a list of configured RMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4496) Improve HA ResourceManager Failover detection on the client
Arun Suresh created YARN-4496: - Summary: Improve HA ResourceManager Failover detection on the client Key: YARN-4496 URL: https://issues.apache.org/jira/browse/YARN-4496 Project: Hadoop YARN Issue Type: Improvement Components: client, resourcemanager Reporter: Arun Suresh Assignee: Arun Suresh HDFS deployments can currently use the {{RequestHedgingProxyProvider}} to improve Namenode failover detection in the client. It does this by concurrently trying all namenodes and picks the namenode that returns the fastest with a successful response as the active node. It would be useful to have a similar ProxyProvider for the Yarn RM (it can possibly be done by converging some the class hierarchies to use the same ProxyProvider) This would especially be useful for large YARN deployments with multiple standby RMs where clients will be able to pick the active RM without having to traverse a list of configured RMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3480: --- Attachment: YARN-3480.13.patch > Recovery may get very slow with lots of services with lots of app-attempts > -- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, > YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, > YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, > YARN-3480.09.patch, YARN-3480.10.patch, YARN-3480.11.patch, > YARN-3480.12.patch, YARN-3480.13.patch > > > When RM HA is enabled and running containers are kept across attempts, apps > are more likely to finish successfully with more retries(attempts), so it > will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However > it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make > RM recover process much slower. It might be better to set max attempts to be > stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4494) Recover completed apps asynchronously
[ https://issues.apache.org/jira/browse/YARN-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069014#comment-15069014 ] Jun Gong commented on YARN-4494: [~sunilg] Yes, I think we all agree with the option "an on-demand fast recovery ". Sorry for my last response for misunderstanding. > Recover completed apps asynchronously > - > > Key: YARN-4494 > URL: https://issues.apache.org/jira/browse/YARN-4494 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > > With RM HA enabled, when recovering apps, recover completed apps > asynchronously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068996#comment-15068996 ] Jun Gong commented on YARN-4459: Thanks [~Naganarasimha] for the info. Yes, it seems same problem. I think the problem does not only exist for 'DelayedProcessKiller', it might occur for every call to 'signal_container_as_user'. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4482) Default values of several config parameters are missing
[ https://issues.apache.org/jira/browse/YARN-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068959#comment-15068959 ] Tianyin Xu commented on YARN-4482: -- I got some time to look into this. Yes... the values changed... (that's why the values are commented...) The default of {{yarn.client.failover-max-attempts}} becomes {{-1}}, while the defaults of {{yarn.client.failover-sleep-base-ms}} and {{yarn.client.failover-sleep-max-ms}} are equivalent to the default value of {{yarn.resourcemanager.connect.retry-interval.ms}} (which is {{3}} currently). The code is in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java {code:title=RMProxy.java|borderStyle=solid} 177 long rmConnectionRetryIntervalMS = 178 conf.getLong( 179 YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS, 180 YarnConfiguration 181 .DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS); .. 203 final long failoverSleepBaseMs = conf.getLong( 204 YarnConfiguration.CLIENT_FAILOVER_SLEEPTIME_BASE_MS, 205 rmConnectionRetryIntervalMS); 206 207 final long failoverSleepMaxMs = conf.getLong( 208 YarnConfiguration.CLIENT_FAILOVER_SLEEPTIME_MAX_MS, 209 rmConnectionRetryIntervalMS); 210 211 int maxFailoverAttempts = conf.getInt( 212 YarnConfiguration.CLIENT_FAILOVER_MAX_ATTEMPTS, -1); {code} The information should be updated in the docs. > Default values of several config parameters are missing > > > Key: YARN-4482 > URL: https://issues.apache.org/jira/browse/YARN-4482 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Affects Versions: 2.6.2, 2.6.3 >Reporter: Tianyin Xu >Assignee: Mohammad Shahid Khan >Priority: Minor > > In {{yarn-default.xml}}, the default values of the following parameters are > commented out, > {{yarn.client.failover-max-attempts}} > {{yarn.client.failover-sleep-base-ms}} > {{yarn.client.failover-sleep-max-ms}} > Are these default values changed (I suppose so)? If so, we should update the > new ones in {{yarn-default.xml}}. Right now, I don't know the real "default" > values... > (yarn-default.xml) > https://hadoop.apache.org/docs/r2.6.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml > https://hadoop.apache.org/docs/r2.6.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the ATSv2 reader side REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068950#comment-15068950 ] Li Lu commented on YARN-4224: - Thanks Varun. Some of my comments: bq. You mean just have one endpoint and based on delimiters in UID, decide whether to fetch single entity or multiple ? It's about locating resources in the system. /flows will be a query endpoint and we can send query parameters there, but /flows/\{uid\} will locate one single flow. What I'm confusing right now is, why do we need to have both plural and singular forms /run/\{uid\} and /runs/\{uid\}? Will they locate to the same run, given the same UID? bq. Because, the query before query for entities, in which we return UID, will be query for apps. This query is from application table where we do not have entity related information. So I cannot send all the possible entity types for an entity or a list of entities in the response. Hence when we query list of entities, it is within the scope of app UID and entity type. Hence entity type has to be specified. bq. Yes, if we try to query all entity types and related entities, this would require scanning quite a bit of the entity table which can grow quite big. And from UI, I envisage only queries for APP_ATTEMPT and CONTAINER so we would know the entity type. So the UID in /entities/\{entitytype\}/\{uid\}/ is actually app UID? This make the whole endpoint looks really weird... I thought it's an entity UID to locate to one timeline entity. However, I think you raised a very useful use case to query a certain type of entity for one application. Maybe we'd like to change the format of this endpoint to address this case? I don't really feel like the current form of the endpoint... bq. Moreover, runs endpoint will do it for you i.e. fetch all flowruns for a flow. OK this works for now. In future, if flows are associated with flow level aggregation data, we will need endpoints to retrieve flow level data. We can skip this step for our first milestone though. bq. There is. The \/entity\/{uid}\/ endpoint. I hope this is what your question was. So to find one entity with cluster, user, flow, flowrun, appid and entity id, we do not have the hierarchical endpoint, but can only get an entity through the UID interface? Do we need the hierarchical interface for CLIs? We can certainly discuss more of these in our meeting tomorrow. > Change the ATSv2 reader side REST interface to conform to current REST APIs' > in YARN > > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4224-YARN-2928.01.patch, > YARN-4224-feature-YARN-2928.wip.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068940#comment-15068940 ] Daniel Templeton commented on YARN-4311: Oh, and I'm a big proponent of using blank lines to make the code easier to read. > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch, YARN-4311-v2.patch, > YARN-4311-v3.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068939#comment-15068939 ] Daniel Templeton commented on YARN-4311: In the constructor, you check if the inactive nodes list entries are null before using them. Is that necessary? Further down in {{refreshNodes()}} you don't. Same thing in {{refreshNodesGracefully()}}. Inside the new for loop in {{refreshNodes()}} you should combine the nested ifs into a single conditional. The multiple ifs and returns in {{isUntracked()}} should be combined: {code} public boolean isUntrackedNode(String hostName) { boolean untracked = false; String ip = resolver.resolve(hostName); includesFile = conf.get(YarnConfiguration.RM_NODES_INCLUDE_FILE_PATH, YarnConfiguration.DEFAULT_RM_NODES_INCLUDE_FILE_PATH); synchronized (hostsReader) { Set hostsList = hostsReader.getHosts(); Set excludeList = hostsReader.getExcludedHosts(); untracked = !hostsList.isEmpty() && !hostsList.contains(hostName) && !hostsList.contains(ip) && !excludeList.contains(hostName) && !excludeList.contains(ip); } return untracked; } {code} Do note that with this solution, if a user does a node refresh at least once per node removal check interval, no nodes will ever be expunged because the timestamp will continually be updated and never exceed the interval. > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch, YARN-4311-v2.patch, > YARN-4311-v3.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068934#comment-15068934 ] Jian He commented on YARN-4138: --- looks like some files are created at a wrong direcotry "hadoop-yarn/" instead of hadoop-yarn-project > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers
[ https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-2885: -- Attachment: YARN-2885-yarn-2877.v4.patch Rebasing patch against latest YARN-2882 and YARN-4335 patches > Create AMRMProxy request interceptor for distributed scheduling decisions for > queueable containers > -- > > Key: YARN-2885 > URL: https://issues.apache.org/jira/browse/YARN-2885 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Attachments: YARN-2885-yarn-2877.001.patch, > YARN-2885-yarn-2877.002.patch, YARN-2885-yarn-2877.full-2.patch, > YARN-2885-yarn-2877.full-3.patch, YARN-2885-yarn-2877.full.patch, > YARN-2885-yarn-2877.v4.patch, YARN-2885_api_changes.patch > > > We propose to add a Local ResourceManager (LocalRM) to the NM in order to > support distributed scheduling decisions. > Architecturally we leverage the RMProxy, introduced in YARN-2884. > The LocalRM makes distributed decisions for queuable containers requests. > Guaranteed-start requests are still handled by the central RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4397) if this addAll() function`s params is fault? @NodeListManager#getUnusableNodes()
[ https://issues.apache.org/jira/browse/YARN-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-4397: -- Fix Version/s: (was: 2.8.0) > if this addAll() function`s params is fault? > @NodeListManager#getUnusableNodes() > > > Key: YARN-4397 > URL: https://issues.apache.org/jira/browse/YARN-4397 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.6.0 >Reporter: Feng Yuan > > code in NodeListManager#144L: > /** >* Provides the currently unusable nodes. Copies it into provided > collection. >* @param unUsableNodes >* Collection to which the unusable nodes are added >* @return number of unusable nodes added >*/ > public int getUnusableNodes(Collection unUsableNodes) { > unUsableNodes.addAll(unusableRMNodesConcurrentSet); > return unusableRMNodesConcurrentSet.size(); > } > unUsableNodes and unusableRMNodesConcurrentSet's sequence is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068650#comment-15068650 ] Hadoop QA commented on YARN-4062: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 5s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 36s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-timelineservice in feature-YARN-2928 failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 52s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.8.0_66 with JDK v1.8.0_66 generated 1 new issues (was 0, now 1). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 2m 12s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.7.0_91 with JDK v1.7.0_91 generated 1 new issues (was 0, now 1). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 10s {color} | {color:red} Patch generated 21 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice (total was 45, now 65). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 12 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 44s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice introduced 1 new FindBugs issues. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 2m 1s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.7.0_91 with JDK v1.7.0_91 generated 5 new issues (was 0, now 5). {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit
[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068589#comment-15068589 ] Hadoop QA commented on YARN-3480: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 46s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 45s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 53s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 146m 38s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12779088/YARN-3480.12.patch | | JIRA Issue | YARN-3480 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs c
[jira] [Updated] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-4062: - Attachment: YARN-4062-feature-YARN-2928.01.patch Thanks [~sjlee0], renamed the patch now. > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.1.patch, > YARN-4062-feature-YARN-2928.01.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4098) Document ApplicationPriority feature
[ https://issues.apache.org/jira/browse/YARN-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068535#comment-15068535 ] Sunil G commented on YARN-4098: --- Hi [~rohithsharma] Thank you for sharing the patch. Some minor nits. CS Page: 1. {noformat} Higher the integer higher the priority of an applications. Note : Application priority is supported only for FIFO ordering policy. {noformat} Could we rephrase this as {{Higher integer value indicates higher priority for the application. Currently Application priority is supported only for FIFO ordering policy.}} 2. {{Application priority works along with FIFO ordering policy only}} can be {{Application priority works *only* along with FIFO ordering policy}} 3. {{Default priority for an applications can be at cluster level and queue level.}} its better to mention as {{an application}}. 4. Missing */* in {{etc/hadoop/capacity-scheduler.xml}} 5. typo in {{yarn.scheduler.capacity.root..default-application-priority}}. it can be {{}} > Document ApplicationPriority feature > > > Key: YARN-4098 > URL: https://issues.apache.org/jira/browse/YARN-4098 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: 0001-YARN-4098.patch, 0001-YARN-4098.patch, YARN-4098.rar > > > This JIRA is to track documentation of application priority and its user, > admin and REST interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient
[ https://issues.apache.org/jira/browse/YARN-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068509#comment-15068509 ] Hadoop QA commented on YARN-4352: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 46s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 25s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 13s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 2s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 6s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 1s {color} | {color:red} hadoop-common in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 13s {color} | {color:red} hadoop-common in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 74m 12s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.net.TestNetUtils | | JDK v1.7.0_91 Failed junit tests | hadoop.security.ssl.TestReloadingX509TrustManager | | | hadoop.net.TestNetUtils | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12779090/0001-YARN-4352.patch | | JIRA Issue | YARN-4352 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux dc18371b873e 3.13.0-36-l
[jira] [Commented] (YARN-4492) Add documentation for queue level preemption which is supported in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068503#comment-15068503 ] Daniel Templeton commented on YARN-4492: +1 (non-binding). Thanks, [~Naganarasimha]! > Add documentation for queue level preemption which is supported in Capacity > scheduler > - > > Key: YARN-4492 > URL: https://issues.apache.org/jira/browse/YARN-4492 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Minor > Attachments: CapacityScheduler.html, YARN-4492.v1.001.patch, > YARN-4492.v1.002.patch, YARN-4492.v1.003.patch > > > As part of YARN-2056, Support has been added to disable preemption for a > specific queue. This is a useful feature in a multiload cluster but currently > missing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient
[ https://issues.apache.org/jira/browse/YARN-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068493#comment-15068493 ] Rohith Sharma K S commented on YARN-4352: - Tx Sunil for providing patch. 3rd approach looks fine to me. I ran test before and after applying the patch in my laptop(ubuntu installed), test cases is passing. But I am not sure why these test cases are failing last couple of months!!! May be I suspect that jenkin machine is changed. +1 lgtm, I will commit it tomorrow if there is no objections > Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient > > > Key: YARN-4352 > URL: https://issues.apache.org/jira/browse/YARN-4352 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Sunil G > Labels: security > Attachments: 0001-YARN-4352.patch > > > From > https://builds.apache.org/job/PreCommit-YARN-Build/9661/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client-jdk1.7.0_79.txt, > we can see the tests in TestYarnClient, TestAMRMClient and TestNMClient get > timeout which can be reproduced locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
[ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068490#comment-15068490 ] Sangjin Lee commented on YARN-3816: --- Just to clarify, it appears both aggregated (but not accumulated over time) metrics and the aggregated *and* accumulated metrics (\*-AREA metrics) end up in these separate entities. While it might be fine for the \*-AREA metrics to be in the separate entities, I think it would be better for the regularly aggregated metrics to be in the application. > [Aggregation] App-level aggregation and accumulation for YARN system metrics > > > Key: YARN-3816 > URL: https://issues.apache.org/jira/browse/YARN-3816 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du > Labels: yarn-2928-1st-milestone > Attachments: Application Level Aggregation of Timeline Data.pdf, > YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, > YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, > YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, > YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, > YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, > YARN-3816-poc-v2.patch > > > We need application level aggregation of Timeline data: > - To present end user aggregated states for each application, include: > resource (CPU, Memory) consumption across all containers, number of > containers launched/completed/failed, etc. We need this for apps while they > are running as well as when they are done. > - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be > aggregated to show details of states in framework level. > - Other level (Flow/User/Queue) aggregation can be more efficient to be based > on Application-level aggregations rather than raw entity-level data as much > less raws need to scan (with filter out non-aggregated entities, like: > events, configurations, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068468#comment-15068468 ] Naganarasimha G R commented on YARN-3367: - check style issues some cannot be handled(line lengths > 80) and some testcases and findbugs are valid , can get it fixed along with review comments on the approach > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068450#comment-15068450 ] Sangjin Lee commented on YARN-4062: --- You'll need to use a name like {{YARN-4062-feature-YARN-2928.01.patch}}. > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.1.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient
[ https://issues.apache.org/jira/browse/YARN-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4352: - Labels: security (was: ) > Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient > > > Key: YARN-4352 > URL: https://issues.apache.org/jira/browse/YARN-4352 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Sunil G > Labels: security > Attachments: 0001-YARN-4352.patch > > > From > https://builds.apache.org/job/PreCommit-YARN-Build/9661/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client-jdk1.7.0_79.txt, > we can see the tests in TestYarnClient, TestAMRMClient and TestNMClient get > timeout which can be reproduced locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068420#comment-15068420 ] Junping Du commented on YARN-3223: -- Hi [~brookz], thanks for updating the patch. The current approach sounds OK to me. Only one issue here is: there is time window between completedContainer() and RMNodeResourceUpdateEvent get handled. So if a scheduling effort happens within this window, the new container could still get allocated on this node. Even worse case is if scheduling effort happen after RMNodeResourceUpdateEvent sent out but before it propagated to SchedulerNode, then you will find the total resource is lower than used resource and available resource is a negative value. IMO, a safer way is: besides your existing RMNodeResourceUpdateEvent update, in completedContainer() for decommissioning nodes, we can hold on adding back availableResource in SchedulerNode, but continue to deduct usedResource. At this moment, SchedulerNode's total resource will be lower than usedResource + availableResource, but it will soon corrected after RMNodeResourceUpdateEvent comes. How does this sound? > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism
[ https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068403#comment-15068403 ] Hadoop QA commented on YARN-3542: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 15s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 8m 17s {color} | {color:red} hadoop-yarn-project_hadoop-yarn-jdk1.8.0_66 with JDK v1.8.0_66 generated 1 new issues (was 9, now 10). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 10m 35s {color} | {color:red} hadoop-yarn-project_hadoop-yarn-jdk1.7.0_91 with JDK v1.7.0_91 generated 2 new issues (was 10, now 12). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 4 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 269, now 265). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 15s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 21s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 36s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 37s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m
[jira] [Updated] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient
[ https://issues.apache.org/jira/browse/YARN-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4352: -- Attachment: 0001-YARN-4352.patch Attaching a simpler patch to lookup when multiple loopback address is present in {{/etc/hosts}} [~djp]/[~rohithsharma]/[~ozawa]. Could you please check > Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient > > > Key: YARN-4352 > URL: https://issues.apache.org/jira/browse/YARN-4352 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Sunil G > Attachments: 0001-YARN-4352.patch > > > From > https://builds.apache.org/job/PreCommit-YARN-Build/9661/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client-jdk1.7.0_79.txt, > we can see the tests in TestYarnClient, TestAMRMClient and TestNMClient get > timeout which can be reproduced locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068394#comment-15068394 ] Jun Gong commented on YARN-3480: Attach a new patch, move remove attempts logic to RMStateStore, then it could deal with cases: fail to store attempts and fail to remove attempts. > Recovery may get very slow with lots of services with lots of app-attempts > -- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, > YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, > YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, > YARN-3480.09.patch, YARN-3480.10.patch, YARN-3480.11.patch, YARN-3480.12.patch > > > When RM HA is enabled and running containers are kept across attempts, apps > are more likely to finish successfully with more retries(attempts), so it > will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However > it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make > RM recover process much slower. It might be better to set max attempts to be > stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3480: --- Attachment: YARN-3480.12.patch > Recovery may get very slow with lots of services with lots of app-attempts > -- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, > YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, > YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, > YARN-3480.09.patch, YARN-3480.10.patch, YARN-3480.11.patch, YARN-3480.12.patch > > > When RM HA is enabled and running containers are kept across attempts, apps > are more likely to finish successfully with more retries(attempts), so it > will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However > it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make > RM recover process much slower. It might be better to set max attempts to be > stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to > a small value, retried attempts might be very large. So we need to delete > some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4494) Recover completed apps asynchronously
[ https://issues.apache.org/jira/browse/YARN-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068377#comment-15068377 ] Sunil G commented on YARN-4494: --- Thanks [~hex108] for filing this. It will speed up the recovery. For the specific scenario mentioned here, I feel a fast recovery is better than blocking the entire client call. We are not sure how much time it may take for recovery of completed apps. So an on-demand fast recovery looks a good option, provided we will keep some cache information that these fast-recovered apps which are not to be recovered again. > Recover completed apps asynchronously > - > > Key: YARN-4494 > URL: https://issues.apache.org/jira/browse/YARN-4494 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > > With RM HA enabled, when recovering apps, recover completed apps > asynchronously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068355#comment-15068355 ] Hadoop QA commented on YARN-3367: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 53s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 13s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 28s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 35s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 43s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 49s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 39s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 39s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 31s {color} | {color:red} Patch generated 13 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 48, now 58). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 36s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common introduced 2 new FindBugs issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 19s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 19s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} unit {col
[jira] [Commented] (YARN-4373) Jobs can be temporarily forgotten during recovery
[ https://issues.apache.org/jira/browse/YARN-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068335#comment-15068335 ] Daniel Templeton commented on YARN-4373: I'm also incredulous. I'm still working to reproduce the issue. It was reported by our testing team. If/when I reproduce it, I'll post the details. > Jobs can be temporarily forgotten during recovery > - > > Key: YARN-4373 > URL: https://issues.apache.org/jira/browse/YARN-4373 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > > The RM becomes available to service requests before state store recovery is > started. Before recovery and during the recovery period, it's possible for a > client to request an application report for a running application to which > the RM will respond that the application in unknown. > I'm seeing this issue with Oozie during an RM failover. Until the active > finishes recovery, it reports erroneous information to Oozie, which doesn't > have context to know that it should just try again later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3586) RM only get back addresses of Collectors that NM needs to know.
[ https://issues.apache.org/jira/browse/YARN-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068316#comment-15068316 ] Junping Du commented on YARN-3586: -- Thanks all for help in review! > RM only get back addresses of Collectors that NM needs to know. > --- > > Key: YARN-3586 > URL: https://issues.apache.org/jira/browse/YARN-3586 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Labels: yarn-2928-1st-milestone > Fix For: YARN-2928 > > Attachments: YARN-3586-demo.patch, YARN-3586-feature-YARN-2928.patch, > YARN-3586-feature-YARN-2928.v2.patch > > > After YARN-3445, RM cache runningApps for each NM. So RM heartbeat back to NM > should only include collectors' address for running applications against > specific NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism
[ https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3542: Attachment: YARN-3542.007.patch Uploaded a new patch that'll apply on trunk. Just a note on the configurations - all the existing configurations for CPU isolation from CGroupsLCEResourceHandler carry over without any change. My expectation is that we'll continue to use them going forward. The only issue we haven't quite figured out is the configs for enabling the resource handlers. > Re-factor support for CPU as a resource using the new ResourceHandler > mechanism > --- > > Key: YARN-3542 > URL: https://issues.apache.org/jira/browse/YARN-3542 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Sidharta Seethana >Assignee: Varun Vasudev >Priority: Critical > Attachments: YARN-3542.001.patch, YARN-3542.002.patch, > YARN-3542.003.patch, YARN-3542.004.patch, YARN-3542.005.patch, > YARN-3542.006.patch, YARN-3542.007.patch > > > In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier > addition of new resource types in the nodemanager (this was used for network > as a resource - See YARN-2140 ). We should refactor the existing CPU > implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using > the new ResourceHandler mechanism. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3586) RM only get back addresses of Collectors that NM needs to know.
[ https://issues.apache.org/jira/browse/YARN-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068238#comment-15068238 ] Varun Saxena commented on YARN-3586: Will commit it shortly. > RM only get back addresses of Collectors that NM needs to know. > --- > > Key: YARN-3586 > URL: https://issues.apache.org/jira/browse/YARN-3586 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-3586-demo.patch, YARN-3586-feature-YARN-2928.patch, > YARN-3586-feature-YARN-2928.v2.patch > > > After YARN-3445, RM cache runningApps for each NM. So RM heartbeat back to NM > should only include collectors' address for running applications against > specific NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3367: Attachment: YARN-3367-feature-YARN-2928.v1.004.patch WIP patch with fixes for test case > Replace starting a separate thread for post entity with event loop in > TimelineClient > > > Key: YARN-3367 > URL: https://issues.apache.org/jira/browse/YARN-3367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Junping Du >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3367-feature-YARN-2928.003.patch, > YARN-3367-feature-YARN-2928.v1.002.patch, > YARN-3367-feature-YARN-2928.v1.004.patch, YARN-3367.YARN-2928.001.patch > > > Since YARN-3039, we add loop in TimelineClient to wait for > collectorServiceAddress ready before posting any entity. In consumer of > TimelineClient (like AM), we are starting a new thread for each call to get > rid of potential deadlock in main thread. This way has at least 3 major > defects: > 1. The consumer need some additional code to wrap a thread before calling > putEntities() in TimelineClient. > 2. It cost many thread resources which is unnecessary. > 3. The sequence of events could be out of order because each posting > operation thread get out of waiting loop randomly. > We should have something like event loop in TimelineClient side, > putEntities() only put related entities into a queue of entities and a > separated thread handle to deliver entities in queue to collector via REST > call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3586) RM only get back addresses of Collectors that NM needs to know.
[ https://issues.apache.org/jira/browse/YARN-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068187#comment-15068187 ] Naganarasimha G R commented on YARN-3586: - +1, latest patch LGTM > RM only get back addresses of Collectors that NM needs to know. > --- > > Key: YARN-3586 > URL: https://issues.apache.org/jira/browse/YARN-3586 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-3586-demo.patch, YARN-3586-feature-YARN-2928.patch, > YARN-3586-feature-YARN-2928.v2.patch > > > After YARN-3445, RM cache runningApps for each NM. So RM heartbeat back to NM > should only include collectors' address for running applications against > specific NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068068#comment-15068068 ] Naganarasimha G R commented on YARN-3995: - Hi [~sjlee0], As per the discussion we had in the status call, we planned to stop the collector after 2 seconds of the AM container finished, but already we are having a code which waits for one second and then closes the collector. Now IIUC the scope of this jira : # Introduce a configurable period to wait # Instead of spawning multiple threads may be we can have single thread which does this activity ? Or do we need to introduce some thing else ? bq When RM finishes the attempt then it can send one finish event through timelineclient IMO this also will not gurantee that no event is missed. So i think configurable wait period is better. Thoughts ? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4470) Application Master in-place upgrade
[ https://issues.apache.org/jira/browse/YARN-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068034#comment-15068034 ] Steve Loughran commented on YARN-4470: -- talk to [~gsaha] about the work. Thinking more about it, one thing we can't currently do is change the command line -jvm heap, cli arguments, etc. Having a way to update in-place the AM launch context prior to triggering an AM restart could address that > Application Master in-place upgrade > --- > > Key: YARN-4470 > URL: https://issues.apache.org/jira/browse/YARN-4470 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola > Attachments: AM in-place upgrade design. rev1.pdf > > > It would be nice if clients could ask for an AM in-place upgrade. > It will give to YARN the possibility to upgrade the AM, without losing the > work > done within its containers. This allows to deploy bug-fixes and new versions > of the AM incurring in long service downtimes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Enhance filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068029#comment-15068029 ] Varun Saxena commented on YARN-3863: Will update patch to fix checkstyle, whitespace and javac issues > Enhance filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4109) Exception on RM scheduler page loading with labels
[ https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068001#comment-15068001 ] Mohammad Shahid Khan commented on YARN-4109: UT Failure are unrelated to current patch. UT addition/modification not required only small UI change. > Exception on RM scheduler page loading with labels > -- > > Key: YARN-4109 > URL: https://issues.apache.org/jira/browse/YARN-4109 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Mohammad Shahid Khan >Priority: Minor > Attachments: YARN-4109_1.patch > > > Configure node label and load scheduler Page > On each reload of the page the below exception gets thrown in logs > {code} > 2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615) > at > org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.Htt
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067999#comment-15067999 ] Naganarasimha G R commented on YARN-4459: - Is this issue similar to YARN-3678 ? > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4109) Exception on RM scheduler page loading with labels
[ https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067974#comment-15067974 ] Hadoop QA commented on YARN-4109: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 48s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 51s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 6s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 136m 47s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12775857/YARN-4109_1.patch
[jira] [Commented] (YARN-4224) Change the ATSv2 reader side REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067939#comment-15067939 ] Varun Saxena commented on YARN-4224: Formatting issue for one of the comments. Writing it here again. bq. 3. Seems like there is no full path to locate one entity from the cluster, user, flow, run, app, entity type, and entity id. Are we omitting this endpoint deliberately? There is. The {{\/entity\/\{uid\}\/}} endpoint. I hope this is what your question was. > Change the ATSv2 reader side REST interface to conform to current REST APIs' > in YARN > > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4224-YARN-2928.01.patch, > YARN-4224-feature-YARN-2928.wip.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4224) Change the ATSv2 reader side REST interface to conform to current REST APIs' in YARN
[ https://issues.apache.org/jira/browse/YARN-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067935#comment-15067935 ] Varun Saxena commented on YARN-4224: [~gtCarrera9], {quote} 1. I understand we would like to have plural forms for listing (like /flows, /apps) and singular forms for detail (like /flow/{uid}). But then, why do we need both /runs/{uid} and /run/{uid}? The same question also applies to apps. {quote} You mean just have one endpoint and based on delimiters in UID, decide whether to fetch single entity or multiple ? {quote} For entities, we require both UID and type. Why type is not a part of UID (which means UID is not sufficient to identify an entity)? {quote} Because, the query before query for entities, in which we return UID, will be query for apps. This query is from application table where we do not have entity related information. So I cannot send all the possible entity types for an entity or a list of entities in the response. Hence when we query list of entities, it is within the scope of app UID and entity type. Hence entity type has to be specified. {quote} Or, are you planning to support operations like "list all entities in a given entity type"? {quote} Yes, if we try to query all entity types and related entities, this would require scanning quite a bit of the entity table which can grow quite big. And from UI, I envisage only queries for APP_ATTEMPT and CONTAINER so we would know the entity type. {quote} If it is the latter, then do we want to consider put type into query parameters on end point entities? {quote} Normally in REST, mandatory params are kept as part of path. I expect entity type to be a mandatory param. I am assuming we do not want to support queries like get entities for all possible entity types. Thoughts ? bq. For flows, why we' re not including an UID endpoint to locate one flow? This poses a challenge when we'd like to list all flow runs within one flow (or, do we have any other end points to do this work? ). When we query flows, we return all possible flow runs with it. So query for a single flow is not going to give any new information. Moreover, runs endpoint will do it for you i.e. fetch all flowruns for a flow. {quote} 3. Seems like there is no full path to locate one entity from the cluster, user, flow, run, app, entity type, and entity id. Are we omitting this endpoint deliberately? {quote} There is. The {{\/entity\/{uid}\/}} endpoint. I hope this is what your question was. {quote} As a side note, in this patch there are 3 types of "shortcuts" in the URL: omit the cluster id (with default cluster id), omit user id (with default user id) and directly access app id. I'm OK with direct accessing app ids (with cluster id), but do we want to omit the other two? Comments are more then welcome. {quote} Can you elaborate a bit on this ? We have 3 endpoints. One with cluster id, one without cluster id(default cluster from config is taken) and one with UID. For apps, the UID endpoint will contain flow context information. bq. I'm also debating with myself on this. Right now I'm leaning towards to make the UIDs transparent to the storage layer. I am fine either ways. I can see pros and cons for both. Only concern I see is that if we are fetching a lot of entities and specify a high limit, we need to iterate over all the entities again to fill the UID. If it is a mere 50-100 entities then should be negligible difference but what if its very high. Another thing we need to ponder is that whether we need to support pagination for UIs' and if it would be possible to support it. Because then we will have to store some contextual information in reader. Or we can send some info back in response and continue from there for next pagination request. That is handle pagination by ourselves. Not sure if we can do this before 1st milestone though. > Change the ATSv2 reader side REST interface to conform to current REST APIs' > in YARN > > > Key: YARN-4224 > URL: https://issues.apache.org/jira/browse/YARN-4224 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-4224-YARN-2928.01.patch, > YARN-4224-feature-YARN-2928.wip.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4172) Extend DominantResourceCalculator to account for all resources
[ https://issues.apache.org/jira/browse/YARN-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067867#comment-15067867 ] Varun Vasudev commented on YARN-4172: - Looks like we need to rebase the YARN-3926 branch. Unfortunately, we can't do force pushes so we might have to use merge. > Extend DominantResourceCalculator to account for all resources > -- > > Key: YARN-4172 > URL: https://issues.apache.org/jira/browse/YARN-4172 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4172-YARN-3926.001.patch, > YARN-4172-YARN-3926.002.patch > > > Now that support for multiple resources is present in the resource class, we > need to modify DominantResourceCalculator to account for the new resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4027) ApplicationHistory web UI(timeline V1) apps column rendering is incorrect
[ https://issues.apache.org/jira/browse/YARN-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S resolved YARN-4027. - Resolution: Duplicate Closing as duplicate > ApplicationHistory web UI(timeline V1) apps column rendering is incorrect > - > > Key: YARN-4027 > URL: https://issues.apache.org/jira/browse/YARN-4027 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > > It is observed that after YARN-3948, timeline web UI does not rendering > correctly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4172) Extend DominantResourceCalculator to account for all resources
[ https://issues.apache.org/jira/browse/YARN-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067861#comment-15067861 ] Hadoop QA commented on YARN-4172: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | {color:red} docker {color} | {color:red} 22m 34s {color} | {color:red} Docker failed to build yetus/hadoop:a890a31. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12779018/YARN-4172-YARN-3926.002.patch | | JIRA Issue | YARN-4172 | | Powered by | Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10072/console | This message was automatically generated. > Extend DominantResourceCalculator to account for all resources > -- > > Key: YARN-4172 > URL: https://issues.apache.org/jira/browse/YARN-4172 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4172-YARN-3926.001.patch, > YARN-4172-YARN-3926.002.patch > > > Now that support for multiple resources is present in the resource class, we > need to modify DominantResourceCalculator to account for the new resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4234) New put APIs in TimelineClient for ats v1.5
[ https://issues.apache.org/jira/browse/YARN-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067853#comment-15067853 ] Junping Du commented on YARN-4234: -- Thanks [~xgong] for updating the patch. +1 on latest patch. I will commit it tomorrow if no further comments or objections from others. > New put APIs in TimelineClient for ats v1.5 > --- > > Key: YARN-4234 > URL: https://issues.apache.org/jira/browse/YARN-4234 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-4234-2015-11-13.1.patch, > YARN-4234-2015-11-16.1.patch, YARN-4234-2015-11-16.2.patch, > YARN-4234-2015.2.patch, YARN-4234.1.patch, YARN-4234.2.patch, > YARN-4234.2015-11-12.1.patch, YARN-4234.2015-11-12.1.patch, > YARN-4234.2015-11-18.1.patch, YARN-4234.2015-11-18.2.patch, > YARN-4234.2015-11-18.patch, YARN-4234.2015-12-09.patch, > YARN-4234.2015-12-09.patch, YARN-4234.2015-12-17.1.patch, > YARN-4234.2015-12-18.1.patch, YARN-4234.2015-12-18.patch, > YARN-4234.2015-12-21.1.patch, YARN-4234.20151109.patch, > YARN-4234.20151110.1.patch, YARN-4234.2015.1.patch, YARN-4234.3.patch > > > In this ticket, we will add new put APIs in timelineClient to let > clients/applications have the option to use ATS v1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4466) ResourceManager should tolerate unexpected exceptions to happen in non-critical subsystem/services like SystemMetricsPublisher
[ https://issues.apache.org/jira/browse/YARN-4466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067852#comment-15067852 ] Naganarasimha G R commented on YARN-4466: - Hi [~djp] Was looking into the exception trace which caused the RM to go down in YARN-4452 issue {code} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.appAttemptRegistered(SystemMetricsPublisher.java:165) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1533) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1505) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:845) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:110) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:796) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) {code} Here {{rmDispatcher-> ApplicationAttemptEventDispatcher-> ... -> transition in RMAppAttemptImpl -> SMP.appAttemptRegistered}} causes a NPE. In {{AsyncDispatcher.dispatch}} method on exception we check for {{exitOnDispatchException}} and and if true we call System.exit. So was not able to figure out an approach here to have a check such that RM doesn't fail. It was basically a bug. So IMO its *not* always possible to cover all scenarios but what we could do is *expose interface in AsyncDispatcher to allow setting exitOnDispatchException variable and the dispatchers used in ATS can set it false* so that RM doesn't go down for exceptions related to dispatching ATS events. But apart from it even if introduce some mechanism we might miss to get some hidden bugs. Apart from TimelinePublisher only other place in RM i could see AsyncDispatcher used where we can neglect the failures/exceptions is {{RMApplicationHistoryWriter}} which is already deprecated ! Thoughts ? > ResourceManager should tolerate unexpected exceptions to happen in > non-critical subsystem/services like SystemMetricsPublisher > -- > > Key: YARN-4466 > URL: https://issues.apache.org/jira/browse/YARN-4466 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Junping Du >Assignee: Naganarasimha G R > > From my comment in > YARN-4452(https://issues.apache.org/jira/browse/YARN-4452?focusedCommentId=15059805&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15059805), > we should make RM more robust with ignore (but log) unexpected exception in > its non-critical subsystems/services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics
[ https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067814#comment-15067814 ] Rohith Sharma K S commented on YARN-2599: - Apologies for missing the sequence of comments:-( I will try to make a progress on this. > Standby RM should also expose some jmx and metrics > -- > > Key: YARN-2599 > URL: https://issues.apache.org/jira/browse/YARN-2599 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Rohith Sharma K S > > YARN-1898 redirects jmx and metrics to the Active. As discussed there, we > need to separate out metrics displayed so the Standby RM can also be > monitored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4172) Extend DominantResourceCalculator to account for all resources
[ https://issues.apache.org/jira/browse/YARN-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4172: Attachment: YARN-4172-YARN-3926.002.patch bq. 1) Who will update DRC.resourceNames, I found currently it hardcoded to memory and vcores. This will be handled as part of later patches. For now, only memory and vcores are supported. Eventually, the list will be generated via a config. bq. 2) DominantResourceCalculator#compare: Need handle the case when resourceName not existed at lhs or rhs? This would throw an exception in getResourceInformation. bq. Could use ResourceInformation.compareTo instead of ResourceInformation.compareValue since resourceName equals to each other? Is it still necessary to keep compareValue? Good point. Fixed. bq. 3) Suggest to use > = < to do compare since of switch (diff). Fixed. bq. 4) DRC#computeAvailableContainers need consider denominator == 0? The current implementation doesn't handle the case - how should I handle it? bq. 5) Is it possible we can return a dummy ResourceInformation (with value == 0) instead of check exception at lots of places. I would prefer to throw an exception - FairScheduler uses 0 Resource objects in a lot of place - I don't want to return something that indicates the operation was successful. > Extend DominantResourceCalculator to account for all resources > -- > > Key: YARN-4172 > URL: https://issues.apache.org/jira/browse/YARN-4172 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4172-YARN-3926.001.patch, > YARN-4172-YARN-3926.002.patch > > > Now that support for multiple resources is present in the resource class, we > need to modify DominantResourceCalculator to account for the new resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4109) Exception on RM scheduler page loading with labels
[ https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4109: Priority: Minor (was: Major) > Exception on RM scheduler page loading with labels > -- > > Key: YARN-4109 > URL: https://issues.apache.org/jira/browse/YARN-4109 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Mohammad Shahid Khan >Priority: Minor > Attachments: YARN-4109_1.patch > > > Configure node label and load scheduler Page > On each reload of the page the below exception gets thrown in logs > {code} > 2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615) > at > org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpCo
[jira] [Commented] (YARN-4109) Exception on RM scheduler page loading with labels
[ https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067782#comment-15067782 ] Rohith Sharma K S commented on YARN-4109: - +1 lgtm, pending jenkins > Exception on RM scheduler page loading with labels > -- > > Key: YARN-4109 > URL: https://issues.apache.org/jira/browse/YARN-4109 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Mohammad Shahid Khan > Attachments: YARN-4109_1.patch > > > Configure node label and load scheduler Page > On each reload of the page the below exception gets thrown in logs > {code} > 2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615) > at > org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.h
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Attachment: (was: am105361log.tar.gz) > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Attachment: am105361log.tar.gz I update other AM Log > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: am105361log.tar.gz, logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3458) CPU resource monitoring in Windows
[ https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067745#comment-15067745 ] Inigo Goiri commented on YARN-3458: --- [~cnauroth], thank you very much for the review! > CPU resource monitoring in Windows > -- > > Key: YARN-3458 > URL: https://issues.apache.org/jira/browse/YARN-3458 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.7.0 > Environment: Windows >Reporter: Inigo Goiri >Assignee: Inigo Goiri >Priority: Minor > Labels: BB2015-05-TBR, containers, metrics, windows > Fix For: 2.8.0 > > Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, > YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch, YARN-3458-7.patch, > YARN-3458-8.patch, YARN-3458-9.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of getCpuUsagePercent() for > WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to > do it. I reused the CpuTimeTracker using 1 jiffy=1ms. > This was left open by YARN-3122. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3692) Allow REST API to set a user generated message when killing an application
[ https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067730#comment-15067730 ] Rohith Sharma K S commented on YARN-3692: - Tx [~rjainqb]for sharing your use case. It was quite long time looking into this JIRA. I will make a progress on this. Similar line of thinking, we can add diagnosis message for failing the attempt too. > Allow REST API to set a user generated message when killing an application > -- > > Key: YARN-3692 > URL: https://issues.apache.org/jira/browse/YARN-3692 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rajat Jain >Assignee: Rohith Sharma K S > > Currently YARN's REST API supports killing an application without setting a > diagnostic message. It would be good to provide that support. -- This message was sent by Atlassian JIRA (v6.3.4#6332)