[jira] [Commented] (YARN-6539) Create SecureLogin inside Router
[ https://issues.apache.org/jira/browse/YARN-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904254#comment-16904254 ] Subru Krishnan commented on YARN-6539: -- [~yifan.stan], great to hear that you are running Federation in a secure cluster! I would love to hear more details about it. I thought I had mentioned it to [~shenyinjie] but guess not - I am not familiar with the security code. Hopefully [~bibinchundatt] or [~Prabhu Joseph] can help? Also, would it be possible to add a test? Thanks. > Create SecureLogin inside Router > > > Key: YARN-6539 > URL: https://issues.apache.org/jira/browse/YARN-6539 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Xie YiFan >Priority: Minor > Attachments: YARN-6359_1.patch, YARN-6359_2.patch, YARN-6539_3.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9425) Make initialDelay configurable for FederationStateStoreService#scheduledExecutorService
[ https://issues.apache.org/jira/browse/YARN-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852165#comment-16852165 ] Subru Krishnan commented on YARN-9425: -- [~giovanni.fumarola], can you take a look? [~shenyinjie], can you fix the Yetus warnings above? > Make initialDelay configurable for > FederationStateStoreService#scheduledExecutorService > --- > > Key: YARN-9425 > URL: https://issues.apache.org/jira/browse/YARN-9425 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.1.0 >Reporter: Shen Yinjie >Assignee: Shen Yinjie >Priority: Major > Attachments: YARN-9425_1.patch > > > When enable YARN federation, subclusters info in Router Web UI cannot be > loaded immediately, and client cannot find any active subclusters after 5mins > by default ,which is configured by > "yarn.federation.state-store.heartbeat-interval-secs". > IMA,we should seperate 'initialDely' and 'delay' for > FederationStateStoreService#scheduledExecutorService. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9586) [QA] Need more doc for yarn.federation.policy-manager-params when LoadBasedRouterPolicy is used
[ https://issues.apache.org/jira/browse/YARN-9586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852164#comment-16852164 ] Subru Krishnan commented on YARN-9586: -- [~shenyinjie], please check the Javadocs for config information: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/federation/policies/router/LoadBasedRouterPolicy.java#L38 > [QA] Need more doc for yarn.federation.policy-manager-params when > LoadBasedRouterPolicy is used > --- > > Key: YARN-9586 > URL: https://issues.apache.org/jira/browse/YARN-9586 > Project: Hadoop YARN > Issue Type: Wish > Components: federation >Reporter: Shen Yinjie >Priority: Major > > We picked LoadBasedRouterPolicy for YARN federation, but had no idea what to > set to yarn.federation.policy-manager-params. Is there a demo config or more > detailed description for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2915) Enable YARN RM scale out via federation using multiple RM's
[ https://issues.apache.org/jira/browse/YARN-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783922#comment-16783922 ] Subru Krishnan commented on YARN-2915: -- [~liuxun323], great to hear about your interest YARN Federation. It's available from 2.9+, so you are good with both 3.0.0 & 3.2.0 :). > Enable YARN RM scale out via federation using multiple RM's > --- > > Key: YARN-2915 > URL: https://issues.apache.org/jira/browse/YARN-2915 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Subru Krishnan >Priority: Major > Labels: federation > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: FEDERATION_CAPACITY_ALLOCATION_JIRA.pdf, > Federation-BoF.pdf, YARN-Federation-Hadoop-Summit_final.pptx, > Yarn_federation_design_v1.pdf, federation-prototype.patch > > > This is an umbrella JIRA that proposes to scale out YARN to support large > clusters comprising of tens of thousands of nodes. That is, rather than > limiting a YARN managed cluster to about 4k in size, the proposal is to > enable the YARN managed cluster to be elastically scalable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687190#comment-16687190 ] Subru Krishnan commented on YARN-8898: -- {quote}Unfortunately we didn't write ApplicationHomeSubCluster.getProto.getBytes to znode\{quote} Thanks [~bibinchundatt] for bringing this to my attention. The intention was to persist _ApplicationHomeSubCluster_ and that's why it was defined as a proto object in the first place. So I feel it might be better to fix it as at least the API is correct? I mean add the trimmed _ApplicationSubmissionContext_ to _ApplicationHomeSubCluster_ and persist the entire _ApplicationHomeSubCluster_ in _ZK._ For SQL, it's adding a new column so it should be safe as well. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > Attachments: YARN-8898.wip.patch > > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686009#comment-16686009 ] Subru Krishnan commented on YARN-8898: -- Thanks [~bibinchundatt] for the detailed clarification. +1 on trimming as that's what we do for RM HA as well. We should be able to use _ApplicationHomeSubCluster_ itself as addition of a field should still be backward compatible, right? > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-8898.wip.patch > > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684634#comment-16684634 ] Subru Krishnan commented on YARN-8898: -- Thanks [~bibinchundatt] and [~botong] for providing context. I feel the solution is 2 parts: # Save the {{ApplicationSubmissionContext}} in the _FederationStateStore_ and use it to submit UAMs. # Delegate certain APIs to _AMRMProxy_ via the Router, like we do presently for *killApplication*. So for the scope of this Jira I prefer solution 2 as: * it doesn't involve changes to the core wire protocol * is future proof if we require more (or different) fields in future. [~bibinchundatt], does it make sense? Sincerely apologize for the delay as I see you already have a patch with solution 1. Also, it looks to me that only the _ApplicationSubmissionContext_ (in non-federated mode) is persisted in the _RMStateStore_ and if there's a update of an Application priority followed by RM failover, the priority will revert to the original one at submission? > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-8898.wip.patch > > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677499#comment-16677499 ] Subru Krishnan edited comment on YARN-8898 at 11/7/18 1:31 AM: --- [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? {quote} Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work. \\{quote} was (Author: subru): [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? {quote} Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677499#comment-16677499 ] Subru Krishnan edited comment on YARN-8898 at 11/7/18 1:30 AM: --- [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? {quote} Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work. was (Author: subru): [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? ?? Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work.?? > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677499#comment-16677499 ] Subru Krishnan edited comment on YARN-8898 at 11/7/18 1:30 AM: --- [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? {quote} Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work. was (Author: subru): [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? {quote} Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677499#comment-16677499 ] Subru Krishnan commented on YARN-8898: -- [~bibinchundatt]/[~botong], thanks for working on this. I am trying to get up to speed and I have a basic question - what are the client APIs that you are referring to, which we need to support at AMRMProxy level? ?? Initially i was under the impression that its only application priority and label, On further analysis found that we might require a few more for all client API's to work.?? > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8979) Spark on yarn job failed with yarn federation enabled
[ https://issues.apache.org/jira/browse/YARN-8979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677493#comment-16677493 ] Subru Krishnan commented on YARN-8979: -- [~shenyinjie], thanks for reporting this. This is a known issue caused by YARN-4083. We work around this by separating the client configuration (i.e. Spark, Tez, MR) from server configuration (i.e. NM, RM, Router, etc). Unfortunately this involves a code change that will require to have an independent conf dir for clients which might break existing deployments (as everyone will need to clone their conf dirs), so has never been committed. > Spark on yarn job failed with yarn federation enabled > -- > > Key: YARN-8979 > URL: https://issues.apache.org/jira/browse/YARN-8979 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Shen Yinjie >Priority: Major > > when I ran spark job on yarn with yarn federation enabled,job failed and > throw Exception as snapshot. > ps: MR and Tez jobs are OK. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673832#comment-16673832 ] Subru Krishnan commented on YARN-7592: -- Thanks [~rahulanand90] for the clarification. Can you update the patch after removing the flag (which I should mention is great) and quickly revalidate that there's no regression? +1 from my side pending that. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore
[ https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673830#comment-16673830 ] Subru Krishnan commented on YARN-6900: -- [~rahulanand90], I agree with you the parameters are tricky to identify. Programmatically, what we need is a serialize conf as defined [here|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/federation/policies/manager/FederationPolicyManager.java#L101]. Manually, we could start with a key-value map where predefined keys could be router/amrmproxy weights or headroomAlpha. Thoughts? > ZooKeeper based implementation of the FederationStateStore > -- > > Key: YARN-6900 > URL: https://issues.apache.org/jira/browse/YARN-6900 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6900-002.patch, YARN-6900-003.patch, > YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, > YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, > YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, > YARN-6900-YARN-2915-001.patch > > > YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only > support SQL based stores, this JIRA tracks adding a ZooKeeper based > implementation for simplifying deployment as it's already popularly used for > {{RMStateStore}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore
[ https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654036#comment-16654036 ] Subru Krishnan commented on YARN-6900: -- Thanks [~rahulanand90] and [~elgoiri] for raising this. I agree that we do need to add a tool to simplify updating parameters and YARN-3657 was created for the purpose. [~rahulanand90], any chance you are interested in working on it :)? > ZooKeeper based implementation of the FederationStateStore > -- > > Key: YARN-6900 > URL: https://issues.apache.org/jira/browse/YARN-6900 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6900-002.patch, YARN-6900-003.patch, > YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, > YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, > YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, > YARN-6900-YARN-2915-001.patch > > > YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only > support SQL based stores, this JIRA tracks adding a ZooKeeper based > implementation for simplifying deployment as it's already popularly used for > {{RMStateStore}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643949#comment-16643949 ] Subru Krishnan commented on YARN-7592: -- I want to make sure I fully understand the proposal - we will revert the changes in RMProxy and create the FederationClientRMProxy}} (I feel we can skip custom) directly if *yarn.federation.enabled* is set? }} I like the idea, can you ensure couple of things: * This works with both HA enabled or not (for NM, router and AMRMProxy). * Assuming above is true, can we remove *yarn.federation.failover.enabled* flag completely? Thanks for working on this! > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8637) [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator
[ https://issues.apache.org/jira/browse/YARN-8637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615397#comment-16615397 ] Subru Krishnan commented on YARN-8637: -- +1 on your proposal [~botong] from my side as I feel we already have too many configs and your approach also ensures that we don't have to change the API. > [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator > --- > > Key: YARN-8637 > URL: https://issues.apache.org/jira/browse/YARN-8637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8637-YARN-7402.v1.patch > > > The core api for FederationStateStore is provided in _FederationStateStore_. > In this patch, we are added a _FederationGPGStateStore_ api just for GPG. > Specifically, we are adding the API to get full application info from > statestore with the starting timestamp of the app entry, so that the > _ApplicationCleaner_ (YARN-7599) in GPG can delete and cleanup old entries in > the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8755) Add clean up for FederationStore apps
[ https://issues.apache.org/jira/browse/YARN-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614158#comment-16614158 ] Subru Krishnan commented on YARN-8755: -- Thanks [~bibinchundatt]! I see that [~botong] is working on addressing your feedback. I do have a request - can both of you make sure if YARN-6648 needs to be updated with your comments and also include that as part of YARN-7599? > Add clean up for FederationStore apps > - > > Key: YARN-8755 > URL: https://issues.apache.org/jira/browse/YARN-8755 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > We should add clean up logic for applications to home cluster mapping in > federation State store. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614143#comment-16614143 ] Subru Krishnan commented on YARN-7592: -- Thanks [~jira.shegalov] for raising this and [~bibinchundatt] and [~rahulanand90] for the detailed analysis. [~bibinchundatt], I agree that this is related to YARN-8434. Looks like in our test setup, we specify {{FederationRMFailoverProxyProvider}} for non-HA setup and ConfiguredRMFailoverProxyProvider for HA setup in yarn-site. Before we change Server/Client proxies, is it possible to remove *yarn.federation.enabled* flag from yarn-site and check as after (re)looking at the code, that may not be necessary in NMs (only in RMs)? > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8755) Add clean up for FederationStore apps
[ https://issues.apache.org/jira/browse/YARN-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan resolved YARN-8755. -- Resolution: Duplicate [~bibinchundatt], this should be addressed by YARN-6648 & YARN-7599. Your review of the latter will be appreciated. Thanks. > Add clean up for FederationStore apps > - > > Key: YARN-8755 > URL: https://issues.apache.org/jira/browse/YARN-8755 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > We should add clean up logic for applications to home cluster mapping in > federation State store. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5597) YARN Federation improvements
[ https://issues.apache.org/jira/browse/YARN-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605094#comment-16605094 ] Subru Krishnan commented on YARN-5597: -- [~bibinchundatt], we use a RDBMS (SQL) for the Federation store and ZK for RM store as 1) there's no leader election in Federation and 2) We only store metadata for which a DB performs great and not what ZK is intended for (IMHO Zk has abused/misused a lot). That said, [~elgoiri] has a deployment with ZK for both Federation and RM stores, so he should be able to guide you. > YARN Federation improvements > > > Key: YARN-5597 > URL: https://issues.apache.org/jira/browse/YARN-5597 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Subru Krishnan >Assignee: Subru Krishnan >Priority: Major > > This umbrella JIRA tracks set of improvements over the YARN Federation MVP > (YARN-2915) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605092#comment-16605092 ] Subru Krishnan commented on YARN-7592: -- [~bibinchundatt]/[~jira.shegalov], I have tested multiple times with a similar setup (for 2.9 release) and never faced any issues. FYI the FEDERATION_FAILOVER_ENABLED is automatically set by {{FederationProxyProviderUtil}} if HA is enabled as you can see [here|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/federation/failover/FederationProxyProviderUtil.java#L128]. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8637) [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator
[ https://issues.apache.org/jira/browse/YARN-8637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574087#comment-16574087 ] Subru Krishnan commented on YARN-8637: -- Thanks [~botong] for the patch. lt looks mostly good, have a minor comments: * Set the default to either in-memory or ZK (essentially keep it consistent with current *FederationStateStore* configs) as SQL dependency shouldn't be expected out of the box. Also, don't see the need to add a value in yarn-default. * Implementation for ZK is missing. * Please add a test for the in-memory impl as well. * I feel we should some other name than *ApplicationsInfo* (and corresponding getters/setter) as that looks too close to *AppInfo*? I am worried it may cause some confusion. > [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator > --- > > Key: YARN-8637 > URL: https://issues.apache.org/jira/browse/YARN-8637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8637-YARN-7402.v1.patch > > > The core api for FederationStateStore is provided in _FederationStateStore_. > In this patch, we are added a _FederationGPGStateStore_ api just for GPG. > Specifically, we are adding the API to get full application info from > statestore with the starting timestamp of the app entry, so that the > _ApplicationCleaner_ (YARN-7599) in GPG can delete and cleanup old entries in > the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8626) Create HomePolicyManager that sends all the requests to the home subcluster
[ https://issues.apache.org/jira/browse/YARN-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570999#comment-16570999 ] Subru Krishnan commented on YARN-8626: -- Thanks [~elgoiri] for addressing my comments, +1 on the latest patch (v8) pending Yetus. > Create HomePolicyManager that sends all the requests to the home subcluster > --- > > Key: YARN-8626 > URL: https://issues.apache.org/jira/browse/YARN-8626 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Giovanni Matteo Fumarola >Assignee: Íñigo Goiri >Priority: Minor > Fix For: 3.2.0 > > Attachments: YARN-8626.000.patch, YARN-8626.001.patch, > YARN-8626.002.patch, YARN-8626.003.patch, YARN-8626.004.patch, > YARN-8626.005.patch, YARN-8626.006.patch, YARN-8626.007.patch, > YARN-8626.008.patch > > > To have the same behavior as a regular non-federated deployment, one should > be able to submit jobs to the local RM and get the job constrained to that > subcluster. > This JIRA creates an AMRMProxyPolicy that sends resources to the home > subcluster and mimics the behavior of a non-federated cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8626) Create LocalPolicyManager that sends all the requests to the home subcluster
[ https://issues.apache.org/jira/browse/YARN-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570604#comment-16570604 ] Subru Krishnan commented on YARN-8626: -- Thanks [~elgoiri] for the patch. I looked at it, please find my comments below: * We generally use _home_ and not _local_. In this case I suggest replacing _local_ with either _home_ or _reflexive_? * If possible, can you remove the empty *notifyOfResponse* impls from the stateless AMRMProxyPolicies as those are now redundant. * In the {{LocalAMRMProxyPolicy}}, add a check to validate the _home SC_ is indeed active (and corresponding test). * The *FederationPolicyInitializationContext* will not have the _home SC_ set in {{LocalRouterPolicy}} as it's the responsibility of the router to do so (chicken or egg situation :)). If you have capacity reserved, then the ideal approach would be to query the _StateStore_ to figure out which SC has capacity and select that as the _home SC_. If you don't have capacity reserved, then you should use *UniformRandomRouterPolicy* directly. * Add a test for {{LocalRouterPolicy}} if it's still required based on above comment. > Create LocalPolicyManager that sends all the requests to the home subcluster > > > Key: YARN-8626 > URL: https://issues.apache.org/jira/browse/YARN-8626 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Giovanni Matteo Fumarola >Assignee: Íñigo Goiri >Priority: Minor > Fix For: 3.2.0 > > Attachments: YARN-8626.000.patch, YARN-8626.001.patch, > YARN-8626.002.patch, YARN-8626.003.patch, YARN-8626.004.patch, > YARN-8626.005.patch > > > To have the same behavior as a regular non-federated deployment, one should > be able to submit jobs to the local RM and get the job constrained to that > subcluster. > This JIRA creates an AMRMProxyPolicy that sends resources to the home > subcluster and mimics the behavior of a non-federated cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment
[ https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568597#comment-16568597 ] Subru Krishnan commented on YARN-7833: -- [~tanujnay], thanks for the contribution as it's extremely useful. Unfortunately I am not familiar with SLS codebase so hopefully [~leftnoteasy]/[~curino] have some bandwidth to take a look. > [PERF/TEST] Extend SLS to support simulation of a Federated Environment > --- > > Key: YARN-7833 > URL: https://issues.apache.org/jira/browse/YARN-7833 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-7833.v1.patch, YARN-7833.v2.patch, > YARN-7833.v3.patch, YARN-7833.v4.patch, YARN-7833.v5.patch, > YARN-7833.v6.patch, YARN-7833.v7.patch > > > To develop algorithms for federation, it would be of great help to have a > version of SLS that supports multi RMs and GPG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8434) Update federation documentation of Nodemanager configurations
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542264#comment-16542264 ] Subru Krishnan commented on YARN-8434: -- Thanks [~elgoiri] for your feedback. I agree that both the points you raised are valid and we do call out pointing to {{AMRMProxy}} for clients in the [doc|http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/Federation.html#Running_a_Sample_Job] . For the HADOOP_CLIENT_CONF, we should track in the existing Jira - YARN-4083. [~bibinchundatt], do cherry-pick to branch-2/2.9 as well when you commit. Thanks! > Update federation documentation of Nodemanager configurations > - > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8434.001.patch, YARN-8434.002.patch, > YARN-8434.003.patch > > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8434) Update federation documentation of Nodemanager configurations
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540428#comment-16540428 ] Subru Krishnan commented on YARN-8434: -- Thanks [~bibinchundatt] for understanding/verifying! +1 from my side on latest patch (v3). [~elgoiri], do you have any other documentation fixes before this goes in? > Update federation documentation of Nodemanager configurations > - > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: YARN-8434.001.patch, YARN-8434.002.patch, > YARN-8434.003.patch > > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539415#comment-16539415 ] Subru Krishnan commented on YARN-8434: -- Thanks [~bibinchundatt] for the clarification, I understand the confusion now. That documentation is outdated and has to be fixed as now we automatically set the *{{FederationRMFailoverProxyProvider* internally via}} {{FederationProxyProviderUtil and so the NM config overriding is not required. My bad, I apologize.}} {{If it works for you, we can re-purpose the Jira to fix the doc?}} > Nodemanager not registering to active RM in federation > -- > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8434.001.patch, YARN-8434.002.patch > > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8484) Fix NPE during ServiceStop in Router classes
[ https://issues.apache.org/jira/browse/YARN-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537794#comment-16537794 ] Subru Krishnan commented on YARN-8484: -- Thanks [~giovanni.fumarola] for the clarification, +1 from my side. > Fix NPE during ServiceStop in Router classes > > > Key: YARN-8484 > URL: https://issues.apache.org/jira/browse/YARN-8484 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Attachments: YARN-8484.v1.patch > > > Fix NPE during ServiceStop in Router classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7953) [GQ] Data structures for federation global queues calculations
[ https://issues.apache.org/jira/browse/YARN-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537437#comment-16537437 ] Subru Krishnan commented on YARN-7953: -- Thanks [~abmodi] for working on this and [~botong] for the review. I looked at the patch and have a quick comment - since we are fully wire compliant with YARN APIs in Federated mode, the data structures should be part of *GPG* and not *RM*. IIUC they are only used for convenience in GPG for recalculating queue hierarchies. > [GQ] Data structures for federation global queues calculations > -- > > Key: YARN-7953 > URL: https://issues.apache.org/jira/browse/YARN-7953 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-7953-YARN-7402.v1.patch, > YARN-7953-YARN-7402.v2.patch, YARN-7953-YARN-7402.v3.patch, > YARN-7953-YARN-7402.v4.patch, YARN-7953-YARN-7402.v5.patch, > YARN-7953-YARN-7402.v6.patch, YARN-7953.v1.patch > > > This Jira tracks data structures and helper classes used by the core > algorithms of YARN-7402 umbrella Jira (currently YARN-7403, and YARN-7834). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537402#comment-16537402 ] Subru Krishnan commented on YARN-8434: -- Thanks [~bibinchundatt] for the patch. In our deployment, we have separate config (directories itself) for client and server as this allows us to not only control client behavior independent of server for scenarios exactly like this one but also is more secure as now all the server configs are being leaked/shared with clients. Will a similar approach work for you? > Nodemanager not registering to active RM in federation > -- > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8434.001.patch, YARN-8434.002.patch > > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8484) Fix NPE during ServiceStop in Router classes
[ https://issues.apache.org/jira/browse/YARN-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537385#comment-16537385 ] Subru Krishnan commented on YARN-8484: -- Thanks [~giovanni.fumarola] for the quick fix. How did you validate it? Can you add a test? > Fix NPE during ServiceStop in Router classes > > > Key: YARN-8484 > URL: https://issues.apache.org/jira/browse/YARN-8484 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Attachments: YARN-8484.v1.patch > > > Fix NPE during ServiceStop in Router classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520852#comment-16520852 ] Subru Krishnan commented on YARN-8434: -- [~bibinchundatt], thanks for reporting this. I would like to understand the context more, are you trying to use the {{FederationRMFailoverProxyProvider}} for NM - RM communication as we use \{{RequestHedgingRMFailoverProxyProvider}}? We currently use {{FederationRMFailoverProxyProvider}} for AM - RM protocol. > Nodemanager not registering to active RM in federation > -- > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7953) [GQ] Data structures for federation global queues calculations
[ https://issues.apache.org/jira/browse/YARN-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan reassigned YARN-7953: Assignee: Abhishek Modi (was: Carlo Curino) > [GQ] Data structures for federation global queues calculations > -- > > Key: YARN-7953 > URL: https://issues.apache.org/jira/browse/YARN-7953 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-7953.v1.patch > > > This Jira tracks data structures and helper classes used by the core > algorithms of YARN-7402 umbrella Jira (currently YARN-7403, and YARN-7834). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7405) [GQ] Bias container allocations based on global view
[ https://issues.apache.org/jira/browse/YARN-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan reassigned YARN-7405: Assignee: Abhishek Modi (was: Arun Suresh) > [GQ] Bias container allocations based on global view > > > Key: YARN-7405 > URL: https://issues.apache.org/jira/browse/YARN-7405 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Reporter: Carlo Curino >Assignee: Abhishek Modi >Priority: Major > > Each RM in a federation should bias its local allocations of containers based > on the global over/under utilization of queues. As part of this the local RM > should account for the work that other RMs will be doing in between the > updates we receive via the heartbeats of YARN-7404 (the mechanics used for > synchronization). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7404) [GQ] propagate to GPG queue-level utilization/pending information
[ https://issues.apache.org/jira/browse/YARN-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan reassigned YARN-7404: Assignee: Abhishek Modi (was: Jose Miguel Arreola) > [GQ] propagate to GPG queue-level utilization/pending information > - > > Key: YARN-7404 > URL: https://issues.apache.org/jira/browse/YARN-7404 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Reporter: Carlo Curino >Assignee: Abhishek Modi >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7900) [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481046#comment-16481046 ] Subru Krishnan commented on YARN-7900: -- [~botong]/[~asuresh], don't we need this in branch-2 as well? > [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor > > > Key: YARN-7900 > URL: https://issues.apache.org/jira/browse/YARN-7900 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-7900.v1.patch, YARN-7900.v2.patch, > YARN-7900.v3.patch, YARN-7900.v4.patch, YARN-7900.v5.patch, > YARN-7900.v6.patch, YARN-7900.v7.patch, YARN-7900.v8.patch, YARN-7900.v9.patch > > > Inside stateful FederationInterceptor (YARN-7899), we need a component > similar to AMRMClient that remembers all pending (outstands) requests we've > sent to YarnRM, auto re-register and do full pending resend when YarnRM fails > over and throws ApplicationMasterNotRegisteredException back. This JIRA adds > this component as AMRMClientRelayer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8110) AMRMProxy recover should catch for all throwable to avoid premature exit
[ https://issues.apache.org/jira/browse/YARN-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-8110: - Summary: AMRMProxy recover should catch for all throwable to avoid premature exit (was: AMRMProxy recover should catch for all throwable retrying to recover apps) > AMRMProxy recover should catch for all throwable to avoid premature exit > > > Key: YARN-8110 > URL: https://issues.apache.org/jira/browse/YARN-8110 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8110.v1.patch > > > In NM work preserving restart, when AMRMProxy recovers applications one by > one, the current catch only catch for IOException. If one app recovery throws > other thing (e.g. RuntimeException), it will fail the entire AMRMProxy > recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8010) Add config in FederationRMFailoverProxy to not bypass facade cache when failing over
[ https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416529#comment-16416529 ] Subru Krishnan commented on YARN-8010: -- Thanks [~botong] for the contribution and [~giovanni.fumarola] for the review, I have committed this to trunk/branch-3.1/branch-2/branch-2.9. [~botong], it didn't apply cleanly to branch-3.0, so feel free to reopen and provide patch if you want this in 3.0.2+. > Add config in FederationRMFailoverProxy to not bypass facade cache when > failing over > > > Key: YARN-8010 > URL: https://issues.apache.org/jira/browse/YARN-8010 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, > YARN-8010.v2.patch, YARN-8010.v3.patch > > > Today when YarnRM is failing over, the FederationRMFailoverProxy running in > AMRMProxy will perform failover, try to get latest subcluster info from > FederationStateStore and then retry connect to the latest YarnRM master. When > calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache > with a flush flag. When YarnRM is failing over, every AM heartbeat thread > creates a different thread inside FederationInterceptor, each of which keeps > performing failover several times. This leads to a big spike of getSubCluster > call to FederationStateStore. > Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), > YarnRM master slave change might not result in RM addr change. In other > cases, a small delay of getting latest subcluster information may be > acceptable. This patch thus creates a config option, so that it is possible > to ask the FederationRMFailoverProxy to not flush cache when calling > getSubCluster(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8010) Add config in FederationRMFailoverProxy to not bypass facade cache when failing over
[ https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-8010: - Summary: Add config in FederationRMFailoverProxy to not bypass facade cache when failing over (was: add config in FederationRMFailoverProxy to not bypass facade cache when failing over) > Add config in FederationRMFailoverProxy to not bypass facade cache when > failing over > > > Key: YARN-8010 > URL: https://issues.apache.org/jira/browse/YARN-8010 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, > YARN-8010.v2.patch, YARN-8010.v3.patch > > > Today when YarnRM is failing over, the FederationRMFailoverProxy running in > AMRMProxy will perform failover, try to get latest subcluster info from > FederationStateStore and then retry connect to the latest YarnRM master. When > calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache > with a flush flag. When YarnRM is failing over, every AM heartbeat thread > creates a different thread inside FederationInterceptor, each of which keeps > performing failover several times. This leads to a big spike of getSubCluster > call to FederationStateStore. > Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), > YarnRM master slave change might not result in RM addr change. In other > cases, a small delay of getting latest subcluster information may be > acceptable. This patch thus creates a config option, so that it is possible > to ask the FederationRMFailoverProxy to not flush cache when calling > getSubCluster(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over
[ https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399177#comment-16399177 ] Subru Krishnan commented on YARN-8010: -- Thanks [~botong] for the patch and [~giovanni.fumarola] for reviewing it. Do you want this in trunk/branch-2 or or YARN-7402? > add config in FederationRMFailoverProxy to not bypass facade cache when > failing over > > > Key: YARN-8010 > URL: https://issues.apache.org/jira/browse/YARN-8010 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, > YARN-8010.v2.patch, YARN-8010.v3.patch > > > Today when YarnRM is failing over, the FederationRMFailoverProxy running in > AMRMProxy will perform failover, try to get latest subcluster info from > FederationStateStore and then retry connect to the latest YarnRM master. When > calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache > with a flush flag. When YarnRM is failing over, every AM heartbeat thread > creates a different thread inside FederationInterceptor, each of which keeps > performing failover several times. This leads to a big spike of getSubCluster > call to FederationStateStore. > Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), > YarnRM master slave change might not result in RM addr change. In other > cases, a small delay of getting latest subcluster information may be > acceptable. This patch thus creates a config option, so that it is possible > to ask the FederationRMFailoverProxy to not flush cache when calling > getSubCluster(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7405) [GQ] Bias container allocations based on global view
[ https://issues.apache.org/jira/browse/YARN-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan reassigned YARN-7405: Assignee: Arun Suresh (was: Subru Krishnan) > [GQ] Bias container allocations based on global view > > > Key: YARN-7405 > URL: https://issues.apache.org/jira/browse/YARN-7405 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Reporter: Carlo Curino >Assignee: Arun Suresh >Priority: Major > > Each RM in a federation should bias its local allocations of containers based > on the global over/under utilization of queues. As part of this the local RM > should account for the work that other RMs will be doing in between the > updates we receive via the heartbeats of YARN-7404 (the mechanics used for > synchronization). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7945) Java Doc error in UnmanagedAMPoolManager for branch-2
[ https://issues.apache.org/jira/browse/YARN-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370770#comment-16370770 ] Subru Krishnan commented on YARN-7945: -- [~rohithsharma]/[~jlowe], thanks for bringing it to my attention. [~jlowe], I am not sure how the import got dropped as it's in the patch and we specifically ran yetus against branch-2 successfully before committing. [~botong], do you want to provide the quick fix? > Java Doc error in UnmanagedAMPoolManager for branch-2 > - > > Key: YARN-7945 > URL: https://issues.apache.org/jira/browse/YARN-7945 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0, 2.9.1 >Reporter: Rohith Sharma K S >Priority: Major > > In branch-2, I see an java doc error while building package. > {code} > [ERROR] > /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:151: > error: reference not found > [ERROR]* @see ApplicationSubmissionContext > [ERROR] ^ > [ERROR] > /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:204: > error: reference not found > [ERROR]* @see ApplicationSubmissionContext > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7945) Java Doc error in UnmanagedAMPoolManager for branch-2
[ https://issues.apache.org/jira/browse/YARN-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370770#comment-16370770 ] Subru Krishnan edited comment on YARN-7945 at 2/21/18 12:02 AM: [~rohithsharma]/[~jlowe], thanks for bringing it to my attention. [~jlowe], I am not sure how the import got dropped as it's in the patch and we specifically ran yetus against branch-2 successfully before committing. Only likelihood is regression caused by trying to fix an unused import checkstyle warning at commit. [~botong], do you want to provide the quick fix? was (Author: subru): [~rohithsharma]/[~jlowe], thanks for bringing it to my attention. [~jlowe], I am not sure how the import got dropped as it's in the patch and we specifically ran yetus against branch-2 successfully before committing. [~botong], do you want to provide the quick fix? > Java Doc error in UnmanagedAMPoolManager for branch-2 > - > > Key: YARN-7945 > URL: https://issues.apache.org/jira/browse/YARN-7945 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0, 2.9.1 >Reporter: Rohith Sharma K S >Priority: Major > > In branch-2, I see an java doc error while building package. > {code} > [ERROR] > /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:151: > error: reference not found > [ERROR]* @see ApplicationSubmissionContext > [ERROR] ^ > [ERROR] > /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:204: > error: reference not found > [ERROR]* @see ApplicationSubmissionContext > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7934) [GQ] Refactor preemption calculators to allow overriding for Federation Global Algos
[ https://issues.apache.org/jira/browse/YARN-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368055#comment-16368055 ] Subru Krishnan commented on YARN-7934: -- Thanks [~curino] for updating the patch, the latest rev (v4) LGTM. The test failure looks unrelated, can you confirm? [~leftnoteasy], do you want to take a quick look before we commit? > [GQ] Refactor preemption calculators to allow overriding for Federation > Global Algos > > > Key: YARN-7934 > URL: https://issues.apache.org/jira/browse/YARN-7934 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Carlo Curino >Priority: Major > Attachments: YARN-7934.v1.patch, YARN-7934.v2.patch, > YARN-7934.v3.patch, YARN-7934.v4.patch > > > This Jira tracks minimal changes in the capacity scheduler preemption > mechanics that allow for sub-classing and overriding of certain behaviors, > which we use to implement federation global algorithms, e.g., in YARN-7403. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7934) [GQ] Refactor preemption calculators to allow overriding for Federation Global Algos
[ https://issues.apache.org/jira/browse/YARN-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366351#comment-16366351 ] Subru Krishnan commented on YARN-7934: -- Thanks [~curino] for the patch, it looks fairly straightforward. I have only nit - can you add Javadocs for the new public and protected (especially so that overriding expectations are clear) methods. Also I don't see any consumers for the public methods, is that in a subsequent patch? > [GQ] Refactor preemption calculators to allow overriding for Federation > Global Algos > > > Key: YARN-7934 > URL: https://issues.apache.org/jira/browse/YARN-7934 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Carlo Curino >Priority: Major > Attachments: YARN-7934.v1.patch, YARN-7934.v2.patch > > > This Jira tracks minimal changes in the capacity scheduler preemption > mechanics that allow for sub-classing and overriding of certain behaviors, > which we use to implement federation global algorithms, e.g., in YARN-7403. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode
[ https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7508: - Target Version/s: (was: 3.1.0, 2.9.1) > NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode > --- > > Key: YARN-7508 > URL: https://issues.apache.org/jira/browse/YARN-7508 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1 > > Attachments: YARN-7508.001.patch > > > YARN-6678 have fixed the IllegalStateException problem but the debug log it > added may cause NPE when trying to print containerId of non-existed reserved > container on this node. Replace > {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}} > with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can > fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode
[ https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7508: - Fix Version/s: 2.9.1 > NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode > --- > > Key: YARN-7508 > URL: https://issues.apache.org/jira/browse/YARN-7508 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1 > > Attachments: YARN-7508.001.patch > > > YARN-6678 have fixed the IllegalStateException problem but the debug log it > added may cause NPE when trying to print containerId of non-existed reserved > container on this node. Replace > {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}} > with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can > fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302037#comment-16302037 ] Subru Krishnan commented on YARN-1709: -- [~xingbao], thanks for your interest. I have responded to you in YARN-1051 [here|https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=16302033]. > Admission Control: Reservation subsystem > > > Key: YARN-1709 > URL: https://issues.apache.org/jira/browse/YARN-1709 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subru Krishnan > Fix For: 2.6.0 > > Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, > YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch > > > This JIRA is about the key data structure used to track resources over time > to enable YARN-1051. The Reservation subsystem is conceptually a "plan" of > how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302033#comment-16302033 ] Subru Krishnan edited comment on YARN-1051 at 12/22/17 10:42 PM: - [~xingbao], the behavior depends on whether there's any job that's using more than it's guaranteed resources in the specific node and if preemption is enabled or not in the cluster. If there's no job using excess resources in the specific node, then either: * relax locality to rack * wait for one of the running job AMs to release container(s) If there is at least one job which is using excess resources in the specific node, then: * If you have preemption is enabled (refer [here | http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption] on how to enable it), the over allocated container(s) will get preempted * wait for one of the running job AMs to release container(s) was (Author: subru): [~xingbao], the behavior depends on whether there's any job that's using more than it's guaranteed resources in the specific node and if preemption is enabled or not in the cluster. If there's no job using excess resources in the specific node, then either: * relax locality to rack * wait for one of the running job AMs to release container(s) If there is at least one job which is using excess resources in the specific node, then: * If you have preemption is enabled (refer [http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption|here] on how to enable it), the over allocated container(s) will get preempted * wait for one of the running job AMs to release container(s) > YARN Admission Control/Planner: enhancing the resource allocation model with > time. > -- > > Key: YARN-1051 > URL: https://issues.apache.org/jira/browse/YARN-1051 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.6.0 > > Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, > YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, > techreport.pdf > > > In this umbrella JIRA we propose to extend the YARN RM to handle time > explicitly, allowing users to "reserve" capacity over time. This is an > important step towards SLAs, long-running services, workflows, and helps for > gang scheduling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302033#comment-16302033 ] Subru Krishnan commented on YARN-1051: -- [~xingbao], the behavior depends on whether there's any job that's using more than it's guaranteed resources in the specific node and if preemption is enabled or not in the cluster. If there's no job using excess resources in the specific node, then either: * relax locality to rack * wait for one of the running job AMs to release container(s) If there is at least one job which is using excess resources in the specific node, then: * If you have preemption is enabled (refer [http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption|here] on how to enable it), the over allocated container(s) will get preempted * wait for one of the running job AMs to release container(s) > YARN Admission Control/Planner: enhancing the resource allocation model with > time. > -- > > Key: YARN-1051 > URL: https://issues.apache.org/jira/browse/YARN-1051 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.6.0 > > Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, > YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, > techreport.pdf > > > In this umbrella JIRA we propose to extend the YARN RM to handle time > explicitly, allowing users to "reserve" capacity over time. This is an > important step towards SLAs, long-running services, workflows, and helps for > gang scheduling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7630) Fix AMRMToken rollover handling in AMRMProxy
[ https://issues.apache.org/jira/browse/YARN-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7630: - Summary: Fix AMRMToken rollover handling in AMRMProxy (was: Fix AMRMToken handling in AMRMProxy) > Fix AMRMToken rollover handling in AMRMProxy > > > Key: YARN-7630 > URL: https://issues.apache.org/jira/browse/YARN-7630 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Attachments: YARN-7630.v1.patch, YARN-7630.v1.patch > > > Symptom: after RM rolls over the master key for AMRMToken, whenever the RPC > connection from FederationInterceptor to RM breaks due to transient network > issue and reconnects, heartbeat to RM starts failing because of the “Invalid > AMRMToken” exception. Whenever it hits, it happens for both home RM and > secondary RMs. > Related facts: > 1. When RM issues a new AMRMToken, it always send with service name field as > empty string. RPC layer in AM side will set it properly before start using > it. > 2. UGI keeps all tokens using a map from serviceName->Token. Initially > AMRMClientUtils.createRMProxy() is used to load the first token and start the > RM connection. > 3. When RM renew the token, YarnServerSecurityUtils.updateAMRMToken() is used > to load it into UGI and replace the existing token (with the same serviceName > key). > Bug: > The bug is that 2-AMRMClientUtils.createRMProxy() and > 3-YarnServerSecurityUtils.updateAMRMToken() is not handling the sequence > consistently. We always need to load the token (with empty service name) into > UGI first before we set the serviceName, so that the previous AMRMToken will > be overridden. But 2 is doing it reversely. That’s why after RM rolls the > amrmToken, the UGI end up with two tokens. Whenever the RPC connection break > and reconnect, the wrong token could be picked and thus trigger the > exception. > Fix: > Should load the AMRMToken into UGI first and then update the service name > field for RPC -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7652: - Description: We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has outdated info about a _SubCluster_. This is because we handle AM register requests synchronously. This jira proposes to move to async similar to how we operate with allocate invocations. (was: We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in {{AMRMProxy}} and consequently the application is blocked if the _StateStore_ has outdated info about a _SubCluster_. This is because we handle AM register requests synchronously. This jira proposes to move to async similar to how we operate with allocate invocations.) > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Bug > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7642) Container execution type is not updated after promotion/demotion in NMContext
[ https://issues.apache.org/jira/browse/YARN-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7642: - Target Version/s: 2.9.1, 3.0.1 > Container execution type is not updated after promotion/demotion in NMContext > - > > Key: YARN-7642 > URL: https://issues.apache.org/jira/browse/YARN-7642 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: YARN-7642.001.patch > > > Found this bug while working on YARN-7617. After calling API to promote a > container from OPPORTUNISTIC to GUARANTEED, node manager web page still shows > the container execution type as OPPORTUNISTIC. Looks like the container > execution type in NMContext was not updated accordingly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7649) RMContainer state transition exception after container update
[ https://issues.apache.org/jira/browse/YARN-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7649: - Target Version/s: 2.9.1, 3.0.1 > RMContainer state transition exception after container update > - > > Key: YARN-7649 > URL: https://issues.apache.org/jira/browse/YARN-7649 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.0 >Reporter: Weiwei Yang >Assignee: Arun Suresh > > I've been seen this in a cluster deployment as well as in UT, run > {{TestAMRMClient#testAMRMClientWithContainerPromotion}} could reproduce this, > it doesn't fail the test case but following error message is shown up in the > log > {noformat} > 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(480)) - Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > RELEASED at ALLOCATED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:478) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:675) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:155) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(481)) - Invalid event RELEASED on container > container_1513165290804_0001_01_03 > {noformat} > this seems to be related to YARN-6251. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
Subru Krishnan created YARN-7652: Summary: Handle AM register requests asynchronously in FederationInterceptor Key: YARN-7652 URL: https://issues.apache.org/jira/browse/YARN-7652 Project: Hadoop YARN Issue Type: Bug Components: amrmproxy, federation Affects Versions: 2.9.0, 3.0.0 Reporter: Subru Krishnan Assignee: Botong Huang We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in {{AMRMProxy}} and consequently the application is blocked if the _StateStore_ has outdated info about a _SubCluster_. This is because we handle AM register requests synchronously. This jira proposes to move to async similar to how we operate with allocate invocations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container
[ https://issues.apache.org/jira/browse/YARN-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7511: - Target Version/s: 3.1.0, 2.9.1 > NPE in ContainerLocalizer when localization failed for running container > > > Key: YARN-7511 > URL: https://issues.apache.org/jira/browse/YARN-7511 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7511.001.patch > > > Error log: > {noformat} > 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106) > at > java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:834) > 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. > {noformat} > Reproduce this problem: > 1. Container was running and ContainerManagerImpl#localize was called for > this container > 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and > sent out ContainerResourceFailedEvent with null LocalResourceRequest. > 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> > container.resourceSet.resourceLocalizationFailed(null) > I think we can fix this problem through ensuring that request is not null > before remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode
[ https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7508: - Target Version/s: 3.1.0, 2.9.1 > NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode > --- > > Key: YARN-7508 > URL: https://issues.apache.org/jira/browse/YARN-7508 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7508.001.patch > > > YARN-6678 have fixed the IllegalStateException problem but the debug log it > added may cause NPE when trying to print containerId of non-existed reserved > container on this node. Replace > {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}} > with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can > fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7591) NPE in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284441#comment-16284441 ] Subru Krishnan commented on YARN-7591: -- Thanks [~Tao Yang] for the contribution and [~leftnoteasy] for the review/commit. [~leftnoteasy], I see the commit in trunk but not in branch-2/2.9 so are you planning cherry-pick down? > NPE in async-scheduling mode of CapacityScheduler > - > > Key: YARN-7591 > URL: https://issues.apache.org/jira/browse/YARN-7591 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-7591.001.patch, YARN-7591.002.patch > > > Currently in async-scheduling mode of CapacityScheduler, NPE may be raised in > special scenarios as below. > (1) The user should be removed after its last application finished, NPE may > be raised if getting something from user object without the null check in > async-scheduling threads. > (2) NPE may be raised when trying fulfill reservation for a finished > application in {{CapacityScheduler#allocateContainerOnSingleNode}}. > {code} > RMContainer reservedContainer = node.getReservedContainer(); > if (reservedContainer != null) { > FiCaSchedulerApp reservedApplication = getCurrentAttemptForContainer( > reservedContainer.getContainerId()); > // NPE here: reservedApplication could be null after this application > finished > // Try to fulfill the reservation > LOG.info( > "Trying to fulfill reservation for application " + > reservedApplication > .getApplicationId() + " on node: " + node.getNodeID()); > {code} > (3) If proposal1 (allocate containerX on node1) and proposal2 (reserve > containerY on node1) were generated by different async-scheduling threads > around the same time and proposal2 was submitted in front of proposal1, NPE > is raised when trying to submit proposal2 in > {{FiCaSchedulerApp#commonCheckContainerAllocation}}. > {code} > if (reservedContainerOnNode != null) { > // NPE here: allocation.getAllocateFromReservedContainer() should be > null for proposal2 in this case > RMContainer fromReservedContainer = > allocation.getAllocateFromReservedContainer().getRmContainer(); > if (fromReservedContainer != reservedContainerOnNode) { > if (LOG.isDebugEnabled()) { > LOG.debug( > "Try to allocate from a non-existed reserved container"); > } > return false; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6704) Add support for work preserving NM restart when FederationInterceptor is enabled in AMRMProxyService
[ https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6704: - Target Version/s: (was: 3.1.0, 2.9.1) > Add support for work preserving NM restart when FederationInterceptor is > enabled in AMRMProxyService > > > Key: YARN-6704 > URL: https://issues.apache.org/jira/browse/YARN-6704 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang > Fix For: 3.1.0, 2.10.0, 2.9.1 > > Attachments: YARN-6704-YARN-2915.v1.patch, > YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, > YARN-6704.v5.patch, YARN-6704.v6.patch, YARN-6704.v7.patch, > YARN-6704.v8.patch, YARN-6704.v9.patch > > > YARN-1336 added the ability to restart NM without loosing any running > containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN > environment, there's additional state in the {{FederationInterceptor}} to > allow for spanning across multiple sub-clusters, so we need to enhance > {{FederationInterceptor}} to support work-preserving restart. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6704) Add support for work preserving NM restart when FederationInterceptor is enabled in AMRMProxyService
[ https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6704: - Summary: Add support for work preserving NM restart when FederationInterceptor is enabled in AMRMProxyService (was: Add Federation Interceptor restart when work preserving NM is enabled) > Add support for work preserving NM restart when FederationInterceptor is > enabled in AMRMProxyService > > > Key: YARN-6704 > URL: https://issues.apache.org/jira/browse/YARN-6704 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang > Attachments: YARN-6704-YARN-2915.v1.patch, > YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, > YARN-6704.v5.patch, YARN-6704.v6.patch, YARN-6704.v7.patch, > YARN-6704.v8.patch, YARN-6704.v9.patch > > > YARN-1336 added the ability to restart NM without loosing any running > containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN > environment, there's additional state in the {{FederationInterceptor}} to > allow for spanning across multiple sub-clusters, so we need to enhance > {{FederationInterceptor}} to support work-preserving restart. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5871) Add support for reservation-based routing.
[ https://issues.apache.org/jira/browse/YARN-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-5871: - Parent Issue: YARN-7402 (was: YARN-5597) > Add support for reservation-based routing. > -- > > Key: YARN-5871 > URL: https://issues.apache.org/jira/browse/YARN-5871 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: YARN-2915 >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: federation > Attachments: YARN-5871-YARN-2915.01.patch, > YARN-5871-YARN-2915.01.patch, YARN-5871-YARN-2915.02.patch, > YARN-5871-YARN-2915.03.patch, YARN-5871-YARN-2915.04.patch > > > Adding policies that can route reservations, and that then route applications > to where the reservation have been placed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6704) Add Federation Interceptor restart when work preserving NM is enabled
[ https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279354#comment-16279354 ] Subru Krishnan commented on YARN-6704: -- Thanks [~botong] for updating the patch and for the clarification. {code}I've changed the UAM token storage to use local NMSS instead when AMRMProxy HA is not enabled. {code} Can you update the test to assert for the above and we are good to go! > Add Federation Interceptor restart when work preserving NM is enabled > - > > Key: YARN-6704 > URL: https://issues.apache.org/jira/browse/YARN-6704 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang > Attachments: YARN-6704-YARN-2915.v1.patch, > YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, > YARN-6704.v5.patch, YARN-6704.v6.patch, YARN-6704.v7.patch > > > YARN-1336 added the ability to restart NM without loosing any running > containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN > environment, there's additional state in the {{FederationInterceptor}} to > allow for spanning across multiple sub-clusters, so we need to enhance > {{FederationInterceptor}} to support work-preserving restart. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7591) NPE in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7591: - Target Version/s: 2.9.1 > NPE in async-scheduling mode of CapacityScheduler > - > > Key: YARN-7591 > URL: https://issues.apache.org/jira/browse/YARN-7591 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-7591.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, NPE may be raised in > special scenarios as below. > (1) The user should be removed after its last application finished, NPE may > be raised if getting something from user object without the null check in > async-scheduling threads. > (2) NPE may be raised when trying fulfill reservation for a finished > application in {{CapacityScheduler#allocateContainerOnSingleNode}}. > {code} > RMContainer reservedContainer = node.getReservedContainer(); > if (reservedContainer != null) { > FiCaSchedulerApp reservedApplication = getCurrentAttemptForContainer( > reservedContainer.getContainerId()); > // NPE here: reservedApplication could be null after this application > finished > // Try to fulfill the reservation > LOG.info( > "Trying to fulfill reservation for application " + > reservedApplication > .getApplicationId() + " on node: " + node.getNodeID()); > {code} > (3) If proposal1 (allocate containerX on node1) and proposal2 (reserve > containerY on node1) were generated by different async-scheduling threads > around the same time and proposal2 was submitted in front of proposal1, NPE > is raised when trying to submit proposal2 in > {{FiCaSchedulerApp#commonCheckContainerAllocation}}. > {code} > if (reservedContainerOnNode != null) { > // NPE here: allocation.getAllocateFromReservedContainer() should be > null for proposal2 in this case > RMContainer fromReservedContainer = > allocation.getAllocateFromReservedContainer().getRmContainer(); > if (fromReservedContainer != reservedContainerOnNode) { > if (LOG.isDebugEnabled()) { > LOG.debug( > "Try to allocate from a non-existed reserved container"); > } > return false; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267425#comment-16267425 ] Subru Krishnan commented on YARN-7509: -- [~leftnoteasy], the fix version says 2.9.1 but it has not been cherry-picked to branch-2.9. Can you go ahead and do that? Thanks. > AsyncScheduleThread and ResourceCommitterService are still running after RM > is transitioned to standby > -- > > Key: YARN-7509 > URL: https://issues.apache.org/jira/browse/YARN-7509 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Fix For: 3.1.0, 2.9.1, 3.0.1 > > Attachments: YARN-7509.001.patch > > > After RM is transitioned to standby, AsyncScheduleThread and > ResourceCommitterService will receive interrupt signal. When thread is > sleeping, it will ignore the interrupt signal since InterruptedException is > catched inside and the interrupt signal is cleared. > For AsyncScheduleThread, InterruptedException was catched and ignored in > CapacityScheduler#schedule. > For ResourceCommitterService, InterruptedException was catched inside and > ignored in ResourceCommitterService#run. > We should let the interrupt signal out and make these threads exit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7548) TestCapacityOverTimePolicy.testAllocation is flaky
[ https://issues.apache.org/jira/browse/YARN-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261646#comment-16261646 ] Subru Krishnan commented on YARN-7548: -- Thanks [~haibo.chen] for reporting this. Adding [~curino] as he wrote the cool logic to generate allocations in tests :). > TestCapacityOverTimePolicy.testAllocation is flaky > -- > > Key: YARN-7548 > URL: https://issues.apache.org/jira/browse/YARN-7548 > Project: Hadoop YARN > Issue Type: Bug > Components: reservation system >Affects Versions: 3.0.0-beta1 >Reporter: Haibo Chen > > It failed in both YARN-7337 and YARN-6921 jenkins jobs. > org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation[Duration > 90,000,000, height 0.25, numSubmission 1, periodic 8640)] > *Stacktrace* > junit.framework.AssertionFailedError: null > at junit.framework.Assert.fail(Assert.java:55) > at junit.framework.Assert.fail(Assert.java:64) > at junit.framework.TestCase.fail(TestCase.java:235) > at > org.apache.hadoop.yarn.server.resourcemanager.reservation.BaseSharingPolicyTest.runTest(BaseSharingPolicyTest.java:146) > at > org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation(TestCapacityOverTimePolicy.java:136) > *Standard Output* > 2017-11-20 23:57:03,759 INFO [main] recovery.RMStateStore > (RMStateStore.java:transition(538)) - Storing reservation > allocation.reservation_-9026698577416205920_6337917439559340517 > 2017-11-20 23:57:03,759 INFO [main] recovery.RMStateStore > (MemoryRMStateStore.java:storeReservationState(247)) - Storing > reservationallocation for > reservation_-9026698577416205920_6337917439559340517 for plan dedicated > 2017-11-20 23:57:03,760 INFO [main] reservation.InMemoryPlan > (InMemoryPlan.java:addReservation(373)) - Successfully added reservation: > reservation_-9026698577416205920_6337917439559340517 to plan. > In-memory Plan: Parent Queue: dedicatedTotal Capacity: vCores:1000>Step: 1000reservation_-9026698577416205920_6337917439559340517 > user:u1 startTime: 0 endTime: 8640 Periodiciy: 8640 alloc: > [Period: 8640 > 0: > 3423748: > 86223748: > 8640: > 9223372036854775807: null > ] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7551) yarn.resourcemanager.reservation-system.max-periodicity is not in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261641#comment-16261641 ] Subru Krishnan commented on YARN-7551: -- Generally, we have been following the practice of exposing only what we consider as core configs in yarn-default. All advanced configs, we skip as I feel that we have way too many knobs in the first place. > yarn.resourcemanager.reservation-system.max-periodicity is not in > yarn-default.xml > -- > > Key: YARN-7551 > URL: https://issues.apache.org/jira/browse/YARN-7551 > Project: Hadoop YARN > Issue Type: Bug > Components: reservation system >Affects Versions: 3.0.0 >Reporter: Daniel Templeton >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6704) Add Federation Interceptor restart when work preserving NM is enabled
[ https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261588#comment-16261588 ] Subru Krishnan commented on YARN-6704: -- Thanks [~botong] for updating the patch. I went through it and have a few minor comments: * Where are we cleaning up the registry/NMSS entries? This should be done when AM completes and should be covered in tests. * Can we proceed with recovery even though not work preserving for secondaries when there's no registry setup as now we have HA and restart intermixed? * Nit: move log level to debug for recovery of individual containers in home SC. > Add Federation Interceptor restart when work preserving NM is enabled > - > > Key: YARN-6704 > URL: https://issues.apache.org/jira/browse/YARN-6704 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang > Attachments: YARN-6704-YARN-2915.v1.patch, > YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, > YARN-6704.v5.patch > > > YARN-1336 added the ability to restart NM without loosing any running > containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN > environment, there's additional state in the {{FederationInterceptor}} to > allow for spanning across multiple sub-clusters, so we need to enhance > {{FederationInterceptor}} to support work-preserving restart. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7390) All reservation related test cases failed when TestYarnClient runs against Fair Scheduler.
[ https://issues.apache.org/jira/browse/YARN-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259991#comment-16259991 ] Subru Krishnan commented on YARN-7390: -- [~yufeigu]/[~haibo.chen], thanks for fixing this. Shouldn't it be included in branch-2/2.9 as you have 2.9.0 in the affect version? Will be great if you can run the test against branch-2/2.9 before pushing. Thanks! > All reservation related test cases failed when TestYarnClient runs against > Fair Scheduler. > -- > > Key: YARN-7390 > URL: https://issues.apache.org/jira/browse/YARN-7390 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, reservation system >Affects Versions: 2.9.0, 3.0.0, 3.1.0 >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 3.0.1 > > Attachments: YARN-7390.001.patch, YARN-7390.002.patch, > YARN-7390.003.patch, YARN-7390.004.patch, YARN-7390.005.patch > > > All reservation related test cases failed when {{TestYarnClient}} runs > against Fair Scheduler. To reproduce it, you need to set scheduler class to > Fair Scheduler in yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6645) Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6645: - Fix Version/s: (was: 2.9.0) 2.9.1 > Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor > --- > > Key: YARN-6645 > URL: https://issues.apache.org/jira/browse/YARN-6645 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Bingxue Qiu > Fix For: 2.9.1 > > Attachments: error when creating symlink.png > > > when creating symlink after the resource localized in our clusters , an > IOException has been thrown, because the nmPrivateDir doesn't exist. we add a > patch to fix it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.
[ https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7278: - Fix Version/s: (was: 2.9.0) 2.9.1 > LinuxContainer in docker mode will be failed when nodemanager restart, > because timeout for docker is too slow. > -- > > Key: YARN-7278 > URL: https://issues.apache.org/jira/browse/YARN-7278 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 > Environment: CentOS >Reporter: zhengchenyu > Fix For: 2.9.1 > > Original Estimate: 1m > Remaining Estimate: 1m > > In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer > with docker mode. > Container may be failed when nodemanager restart, exception is below: > {code} > [2017-09-29T15:47:14.433+08:00] [INFO] > containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java > 472) [Container Monitor] : Memory usage of ProcessTree 120523 for > container-id container_1506600355508_0023_01_04: -1B of 10 GB physical > memory used; -1B of 31 GB virtual memory used > [2017-09-29T15:47:15.219+08:00] [ERROR] > containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java > 93) [ContainersLauncher #1] : Unable to recover container > container_1506600355508_0023_01_04 > java.io.IOException: Timeout while waiting for exit code from > container_1506600355508_0023_01_04 > [2017-09-29T15:47:15.220+08:00] [INFO] > containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) > [AsyncDispatcher event handler] : Container > container_1506600355508_0023_01_04 transitioned from RUNNING to > EXITED_WITH_FAILURE > [2017-09-29T15:47:15.221+08:00] [INFO] > containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java > 440) [AsyncDispatcher event handler] : Cleaning up container > container_1506600355508_0023_01_04 > {code} > I guess the proccess is done, but 2 seconde later( the variable is msecLeft), > the *.pid.exitcode wasn't created. Then I changed variable to 2ms, The > container is succeed when nodemanger is restart. > So I think it is too short for docker container to complete the work. > In docker mode of LinuxContainer, nm monitor the real task which is launched > by "docker run" command. Then "docker wait" command will wait for exitcode, > then "docker rm" will delete the docker container. Lastly, container-executor > will write the exit code. So if some docker command is slow enough, nm > wouldn't monitor the container. In fact, docker rm is always slow. > I think the exit code of docker rm dosen't matter with the real task, so I > think we could move the operation of write "*.pid.exitcode" before the > command of docker rm. Or monitor the docker wait proccess, but not the real > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6606) The implementation of LocalizationStatus in ContainerStatusProto
[ https://issues.apache.org/jira/browse/YARN-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6606: - Fix Version/s: (was: 2.9.0) 2.9.1 > The implementation of LocalizationStatus in ContainerStatusProto > > > Key: YARN-6606 > URL: https://issues.apache.org/jira/browse/YARN-6606 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Bingxue Qiu > Fix For: 2.9.1 > > Attachments: YARN-6606.1.patch, YARN-6606.2.patch > > > we have a use case, where the full implementation of localization status in > ContainerStatusProto > [Continuous-resource-localization|https://issues.apache.org/jira/secure/attachment/12825041/Continuous-resource-localization.pdf] >need to be done , so we make it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
[ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6661: - Fix Version/s: (was: 2.9.0) 2.9.1 > Too much CLEANUP event hang ApplicationMasterLauncher thread pool > - > > Key: YARN-6661 > URL: https://issues.apache.org/jira/browse/YARN-6661 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: hadoop 2.7.2 >Reporter: JackZhou > Fix For: 2.9.1 > > > Some one else have already come up with the similar problem and fix it. > We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for > detail. > But I think the fix have not solve the problem completely, blow was the > problem I encountered: > There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. > I failover my active rm and rm will failover all those 1800 apps. > When a application failover, It will wait for AM container register itself. > But there is a bug in my AM (I do it intentionally), and it will not register > itself. > So the RM will wait for about 10mins for the AM expiration, and it will send > a CLEANUP event to > ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so > it will hang the ApplicationMasterLauncher > thread pool for a large time. I have already use the > patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), > so > a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so > for each of my thread, it will > hang 1800 / 50 * 200s = 7200s=20min. > Because the AM have register itself during 10mins, so it will retry and > create a new application attempt. > The application attempt will accept a container from RM, and send a LAUNCH to > ApplicationMasterLauncher thread pool. > Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the > application attempt will not > start the AM container during 10min. > And it will expire, and send a CLEANUP event to ApplicationMasterLauncher > thread pools too. > As you can see, none of my application can really run it. > Each of them have 5 application attempts as follows, and each of them keep > retrying. > appattempt_1495786030132_4000_05 > appattempt_1495786030132_4000_04 > appattempt_1495786030132_4000_03 > appattempt_1495786030132_4000_02 > appattempt_1495786030132_4000_01 > So all of my apps have hang several hours, and none of them can really run. > I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. > And use some other thread to deal with LAUNCH event or use other way. > Sorry, I english is so poor. I don't know have I describe it clearly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7190) Ensure only NM classpath in 2.x gets TSv2 related hbase jars, not the user classpath
[ https://issues.apache.org/jira/browse/YARN-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7190: - Fix Version/s: (was: 2.9.0) 2.9.1 > Ensure only NM classpath in 2.x gets TSv2 related hbase jars, not the user > classpath > > > Key: YARN-7190 > URL: https://issues.apache.org/jira/browse/YARN-7190 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Varun Saxena > Fix For: YARN-5355_branch2, 2.9.1 > > Attachments: YARN-7190-YARN-5355_branch2.01.patch, > YARN-7190-YARN-5355_branch2.02.patch, YARN-7190-YARN-5355_branch2.03.patch, > YARN-7190.01.patch > > > [~jlowe] had a good observation about the user classpath getting extra jars > in hadoop 2.x brought in with TSv2. If users start picking up Hadoop 2,x's > version of HBase jars instead of the ones they shipped with their job, it > could be a problem. > So when TSv2 is to be used in 2,x, the hbase related jars should come into > only the NM classpath not the user classpath. > Here is a list of some jars > {code} > commons-csv-1.0.jar > commons-el-1.0.jar > commons-httpclient-3.1.jar > disruptor-3.3.0.jar > findbugs-annotations-1.3.9-1.jar > hbase-annotations-1.2.6.jar > hbase-client-1.2.6.jar > hbase-common-1.2.6.jar > hbase-hadoop2-compat-1.2.6.jar > hbase-hadoop-compat-1.2.6.jar > hbase-prefix-tree-1.2.6.jar > hbase-procedure-1.2.6.jar > hbase-protocol-1.2.6.jar > hbase-server-1.2.6.jar > htrace-core-3.1.0-incubating.jar > jamon-runtime-2.4.1.jar > jasper-compiler-5.5.23.jar > jasper-runtime-5.5.23.jar > jcodings-1.0.8.jar > joni-2.1.2.jar > jsp-2.1-6.1.14.jar > jsp-api-2.1-6.1.14.jar > jsr311-api-1.1.1.jar > metrics-core-2.2.0.jar > servlet-api-2.5-6.1.14.jar > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6128) Add support for AMRMProxy HA
[ https://issues.apache.org/jira/browse/YARN-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257858#comment-16257858 ] Subru Krishnan commented on YARN-6128: -- Thanks [~botong]. I have committed to trunk but couldn't cherry-pick cleanly to branch-2, can you please provide a branch-2 patch? > Add support for AMRMProxy HA > > > Key: YARN-6128 > URL: https://issues.apache.org/jira/browse/YARN-6128 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, nodemanager >Reporter: Subru Krishnan >Assignee: Botong Huang > Attachments: YARN-6128.v0.patch, YARN-6128.v1.patch, > YARN-6128.v1.patch, YARN-6128.v10.patch, YARN-6128.v10.patch, > YARN-6128.v2.patch, YARN-6128.v3.patch, YARN-6128.v3.patch, > YARN-6128.v4.patch, YARN-6128.v5.patch, YARN-6128.v6.patch, > YARN-6128.v7.patch, YARN-6128.v8.patch, YARN-6128.v9.patch > > > YARN-556 added the ability for RM failover without loosing any running > applications. In a Federated YARN environment, there's additional state in > the {{AMRMProxy}} to allow for spanning across multiple sub-clusters, so we > need to enhance {{AMRMProxy}} to support HA. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5075) Fix findbugs warning in hadoop-yarn-common module
[ https://issues.apache.org/jira/browse/YARN-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-5075: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Fix findbugs warning in hadoop-yarn-common module > - > > Key: YARN-5075 > URL: https://issues.apache.org/jira/browse/YARN-5075 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Akira Ajisaka >Assignee: Arun Suresh > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5075.001.patch, YARN-5075.002.patch, findbugs.html > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5049) Extend NMStateStore to save queued container information
[ https://issues.apache.org/jira/browse/YARN-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-5049: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Extend NMStateStore to save queued container information > > > Key: YARN-5049 > URL: https://issues.apache.org/jira/browse/YARN-5049 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5049-addendum.branch-2.001.patch, > YARN-5049.001.patch, YARN-5049.002.patch, YARN-5049.003.patch > > > This JIRA is about extending the NMStateStore to save queued container > information whenever a new container is added to the NM queue. > It also removes the information from the state store when the queued > container starts its execution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5073) Refactor startContainerInternal() in ContainerManager to remove unused parameter
[ https://issues.apache.org/jira/browse/YARN-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-5073: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Refactor startContainerInternal() in ContainerManager to remove unused > parameter > > > Key: YARN-5073 > URL: https://issues.apache.org/jira/browse/YARN-5073 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos >Priority: Minor > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5073.001.patch > > > The nmTokenIdentifier is no longer needed as a parameter in the > startContainerInternal() method of the ContainerManager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4412) Create ClusterMonitor to compute ordered list of preferred NMs for OPPORTUNITIC containers
[ https://issues.apache.org/jira/browse/YARN-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-4412: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Create ClusterMonitor to compute ordered list of preferred NMs for > OPPORTUNITIC containers > -- > > Key: YARN-4412 > URL: https://issues.apache.org/jira/browse/YARN-4412 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Arun Suresh >Assignee: Arun Suresh > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-4412-yarn-2877.v1.patch, > YARN-4412-yarn-2877.v2.patch, YARN-4412-yarn-2877.v3.patch, > YARN-4412-yarn-2877.v4.patch, YARN-4412-yarn-2877.v5.patch, > YARN-4412-yarn-2877.v6.patch, YARN-4412.007.patch, YARN-4412.008.patch, > YARN-4412.009.patch, YARN-4412.addendum-001.patch, YARN-4412.find-bugs.patch > > > Introduce a Cluster Monitor that aggregates load information from individual > Node Managers and computes an ordered list of preferred Node managers to be > used as target Nodes for OPPORTUNISTIC container allocations. > This list can be pushed out to the Node Manager (specifically the AMRMProxy > running on the Node) via the Allocate Response. This will be used to make > local Scheduling decisions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4738) Notify the RM about the status of OPPORTUNISTIC containers
[ https://issues.apache.org/jira/browse/YARN-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-4738: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Notify the RM about the status of OPPORTUNISTIC containers > -- > > Key: YARN-4738 > URL: https://issues.apache.org/jira/browse/YARN-4738 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-4738-yarn-2877.001.patch, > YARN-4738-yarn-2877.002.patch, YARN-4738.002.patch, YARN-4738.003.patch, > YARN-4738.004.patch, YARN-4738.005.patch, YARN-4738.006.patch > > > When an OPPORTUNISTIC container finishes its execution (either successfully > or because it failed/got killed), the RM needs to be notified. > This way the AM also gets notified in turn about the successfully completed > tasks, as well as for rescheduling failed/killed tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4991) Fix ContainerRequest Constructor to set nodelabelExpression correctly
[ https://issues.apache.org/jira/browse/YARN-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-4991: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Fix ContainerRequest Constructor to set nodelabelExpression correctly > - > > Key: YARN-4991 > URL: https://issues.apache.org/jira/browse/YARN-4991 > Project: Hadoop YARN > Issue Type: Sub-task > Components: test >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: 0001-YARN-4991.patch > > > TestAMRMClient#testAskWithInvalidNodeLabels > TestAMRMClient#testAskWithNodeLabels are failing > {{ContainerRequest}} labels are always set as {{null}} > {code} > public ContainerRequest(Resource capability, String[] nodes, String[] > racks, > Priority priority, boolean relaxLocality, String > nodeLabelsExpression) { > this(capability, nodes, racks, priority, relaxLocality, null, > ExecutionType.GUARANTEED); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2995) Enhance UI to show cluster resource utilization of various container Execution types
[ https://issues.apache.org/jira/browse/YARN-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2995: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha2 > Enhance UI to show cluster resource utilization of various container > Execution types > > > Key: YARN-2995 > URL: https://issues.apache.org/jira/browse/YARN-2995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos >Priority: Blocker > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-2995.001.patch, YARN-2995.002.patch, > YARN-2995.003.patch, YARN-2995.004.patch, all-nodes.png, all-nodes.png, > opp-container.png > > > This JIRA proposes to extend the Resource manager UI to show how cluster > resources are being used to run *guaranteed start* and *queueable* > containers. For example, a graph that shows over time, the fraction of > running containers that are *guaranteed start* and the fraction of running > containers that are *queueable*. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4335) Allow ResourceRequests to specify ExecutionType of a request ask
[ https://issues.apache.org/jira/browse/YARN-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-4335: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Allow ResourceRequests to specify ExecutionType of a request ask > > > Key: YARN-4335 > URL: https://issues.apache.org/jira/browse/YARN-4335 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-4335-yarn-2877.001.patch, YARN-4335.002.patch, > YARN-4335.003.patch > > > YARN-2882 introduced container types that are internal (not user-facing) and > are used by the ContainerManager during execution at the NM. > With this JIRA we are introducing (user-facing) resource request types that > are used by the AM to specify the type of the ResourceRequest. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2888) Corrective mechanisms for rebalancing NM container queues
[ https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2888: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Corrective mechanisms for rebalancing NM container queues > - > > Key: YARN-2888 > URL: https://issues.apache.org/jira/browse/YARN-2888 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-2888-yarn-2877.001.patch, > YARN-2888-yarn-2877.002.patch, YARN-2888.003.patch, YARN-2888.004.patch, > YARN-2888.005.patch, YARN-2888.006.patch, YARN-2888.007.patch, > YARN-2888.008.patch, YARN-2888.009.patch, YARN-2888.010.patch, > YARN-2888.011.patch > > > Bad queuing decisions by the LocalRMs (e.g., due to the distributed nature of > the scheduling decisions or due to having a stale image of the system) may > lead to an imbalance in the waiting times of the NM container queues. This > can in turn have an impact in job execution times and cluster utilization. > To this end, we introduce corrective mechanisms that may remove (whenever > needed) container requests from overloaded queues, adding them to less-loaded > ones. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers
[ https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2885: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Create AMRMProxy request interceptor for distributed scheduling decisions for > queueable containers > -- > > Key: YARN-2885 > URL: https://issues.apache.org/jira/browse/YARN-2885 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-2885-yarn-2877.001.patch, > YARN-2885-yarn-2877.002.patch, YARN-2885-yarn-2877.full-2.patch, > YARN-2885-yarn-2877.full-3.patch, YARN-2885-yarn-2877.full.patch, > YARN-2885-yarn-2877.v4.patch, YARN-2885-yarn-2877.v5.patch, > YARN-2885-yarn-2877.v6.patch, YARN-2885-yarn-2877.v7.patch, > YARN-2885-yarn-2877.v8.patch, YARN-2885-yarn-2877.v9.patch, > YARN-2885.010.patch, YARN-2885.011.patch, YARN-2885.012.patch, > YARN-2885_api_changes.patch > > > We propose to add a Local ResourceManager (LocalRM) to the NM in order to > support distributed scheduling decisions. > Architecturally we leverage the RMProxy, introduced in YARN-2884. > The LocalRM makes distributed decisions for queuable containers requests. > Guaranteed-start requests are still handled by the central RM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2883: - Fix Version/s: (was: 3.0.0) 3.0.0-alpha1 > Queuing of container requests in the NM > --- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-2883-trunk.004.patch, YARN-2883-trunk.005.patch, > YARN-2883-trunk.006.patch, YARN-2883-trunk.007.patch, > YARN-2883-trunk.008.patch, YARN-2883-trunk.009.patch, > YARN-2883-trunk.010.patch, YARN-2883-trunk.011.patch, > YARN-2883-trunk.012.patch, YARN-2883-trunk.013.patch, > YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, > YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch, > YARN-2883.013.patch, YARN-2883.014.patch, YARN-2883.015.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2884) Proxying all AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2884: - Fix Version/s: (was: 3.0.0) 2.8.0 3.0.0-alpha1 > Proxying all AM-RM communications > - > > Key: YARN-2884 > URL: https://issues.apache.org/jira/browse/YARN-2884 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Carlo Curino >Assignee: Kishore Chaliparambil > Fix For: 2.8.0, 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, > YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, > YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, > YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, > YARN-2884-V8.patch, YARN-2884-V9.patch > > > We introduce the notion of an RMProxy, running on each node (or once per > rack). Upon start the AM is forced (via tokens and configuration) to direct > all its requests to a new services running on the NM that provide a proxy to > the central RM. > This give us a place to: > 1) perform distributed scheduling decisions > 2) throttling mis-behaving AMs > 3) mask the access to a federation of RMs -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6918) Remove acls after queue delete to avoid memory leak
[ https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249897#comment-16249897 ] Subru Krishnan edited comment on YARN-6918 at 11/13/17 6:04 PM: Thanks [~bibinchundatt] for raising this and [~sunilg] for bringing this to my attention. I took a look at it and feel it's relevant but not really a blocker: * The new queue management is an _experimental_ feature that users/admin have to explicitly *opt-in*. * Deleting a queue should be a rare event (unlike application management) so is not in the routine or normal code path of app execution. Accordingly I have set the priority of this JIRA to major and target version to 2.9.1. was (Author: subru): Thanks [~bibinchundatt] for raising this and [~sunilg] for bringing this to my attention. I took a look at it and feel it's relevant but not really a blocker: * The new queue management is an _experimental_ feature that users/admin have to explicitly *opt-in*. * Deleting a queue should be a rare event (unlike application management). Accordingly I have set the priority of this JIRA to major and target version to 2.9.1. > Remove acls after queue delete to avoid memory leak > --- > > Key: YARN-6918 > URL: https://issues.apache.org/jira/browse/YARN-6918 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: YARN-6918.001.patch > > > Acl for deleted queue need to removed from allAcls to avoid leak > (Priority,YarnAuthorizer) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6918) Remove acls after queue delete to avoid memory leak
[ https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6918: - Target Version/s: 2.9.1 > Remove acls after queue delete to avoid memory leak > --- > > Key: YARN-6918 > URL: https://issues.apache.org/jira/browse/YARN-6918 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: YARN-6918.001.patch > > > Acl for deleted queue need to removed from allAcls to avoid leak > (Priority,YarnAuthorizer) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6918) Remove acls after queue delete to avoid memory leak
[ https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249897#comment-16249897 ] Subru Krishnan commented on YARN-6918: -- Thanks [~bibinchundatt] for raising this and [~sunilg] for bringing this to my attention. I took a look at it and feel it's relevant but not really a blocker: * The new queue management is an _experimental_ feature that users/admin have to explicitly *opt-in*. * Deleting a queue should be a rare event (unlike application management). Accordingly I have set the priority of this JIRA to major and target version to 2.9.1. > Remove acls after queue delete to avoid memory leak > --- > > Key: YARN-6918 > URL: https://issues.apache.org/jira/browse/YARN-6918 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: YARN-6918.001.patch > > > Acl for deleted queue need to removed from allAcls to avoid leak > (Priority,YarnAuthorizer) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6918) Remove acls after queue delete to avoid memory leak
[ https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-6918: - Priority: Major (was: Critical) > Remove acls after queue delete to avoid memory leak > --- > > Key: YARN-6918 > URL: https://issues.apache.org/jira/browse/YARN-6918 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: YARN-6918.001.patch > > > Acl for deleted queue need to removed from allAcls to avoid leak > (Priority,YarnAuthorizer) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7478) TEST-cetest fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan resolved YARN-7478. -- Resolution: Implemented Fix Version/s: 2.9.0 Thanks [~varun_saxena] for bringing this up and [~leftnoteasy] for pointing out that YARN-7412 fixes it, so I simply cherry-picked YARN-7412 to branch=2/2.9/2.9.0. > TEST-cetest fails in branch-2 > - > > Key: YARN-7478 > URL: https://issues.apache.org/jira/browse/YARN-7478 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Subru Krishnan >Priority: Minor > Fix For: 2.9.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7476) Fix miscellaneous issues in ATSv2 after merge to branch-2
[ https://issues.apache.org/jira/browse/YARN-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248932#comment-16248932 ] Subru Krishnan edited comment on YARN-7476 at 11/12/17 6:03 PM: Thanks [~varun_saxena] for the fix, created YARN-7478 to track the test failure. I have committed this to branch-2/branch-2.9/branch-2.9.0. was (Author: subru): Thanks [~varun_saxena] for the fix, I have committed this to branch-2/branch-2.9/branch-2.9.0. > Fix miscellaneous issues in ATSv2 after merge to branch-2 > - > > Key: YARN-7476 > URL: https://issues.apache.org/jira/browse/YARN-7476 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 2.9.0 > > Attachments: YARN-7476-branch-2.01.patch > > > a) We are still using Resource#getMemory in > NMTimelinePublisher#publishContainerCreatedEvent. This has been deprecated > since YARN-4844. Better to use getMemorySize instead. > b) Post YARN-5865, application priority should be fetched from RMAppImpl > instead of app submission context. But we are still fetching it from > submission context while publishing entities to timeline service. This would > mean that if priority is updated, it will not be published to timeline > service. > c) The order of app_collectors in NodeHeartbeatResponseProto is different > from trunk. Better to make it consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7478) TEST-cetest fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248923#comment-16248923 ] Subru Krishnan commented on YARN-7478: -- Refer to Yetus report in YARN-7476/YARN-5049 etc. > TEST-cetest fails in branch-2 > - > > Key: YARN-7478 > URL: https://issues.apache.org/jira/browse/YARN-7478 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Subru Krishnan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7475) Fix Container log link in new YARN UI
[ https://issues.apache.org/jira/browse/YARN-7475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-7475: - Summary: Fix Container log link in new YARN UI (was: container log link is not working in new YARN UI) > Fix Container log link in new YARN UI > - > > Key: YARN-7475 > URL: https://issues.apache.org/jira/browse/YARN-7475 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-beta1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-7475.001.patch > > > Container log link is broken -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5049) Extend NMStateStore to save queued container information
[ https://issues.apache.org/jira/browse/YARN-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248919#comment-16248919 ] Subru Krishnan commented on YARN-5049: -- +1 to the addendum patch. Thanks [~varun_saxena] for reporting it and [~asuresh] for fixing it. The test failure is unrelated and tracked in YARN-7478. > Extend NMStateStore to save queued container information > > > Key: YARN-5049 > URL: https://issues.apache.org/jira/browse/YARN-5049 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Fix For: 2.9.0, 3.0.0 > > Attachments: YARN-5049-addendum.branch-2.001.patch, > YARN-5049.001.patch, YARN-5049.002.patch, YARN-5049.003.patch > > > This JIRA is about extending the NMStateStore to save queued container > information whenever a new container is added to the NM queue. > It also removes the information from the state store when the queued > container starts its execution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7478) TEST-cetest fails in branch-2
Subru Krishnan created YARN-7478: Summary: TEST-cetest fails in branch-2 Key: YARN-7478 URL: https://issues.apache.org/jira/browse/YARN-7478 Project: Hadoop YARN Issue Type: Bug Reporter: Subru Krishnan Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7476) Fix miscellaneous issues in ATSv2 after merge to branch-2
[ https://issues.apache.org/jira/browse/YARN-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248715#comment-16248715 ] Subru Krishnan edited comment on YARN-7476 at 11/11/17 11:07 PM: - Thanks [~varun_saxena] for the thorough investigation to dig this up. I compared the files between trunk and branch-2 and you are spot on, +1 on the patch (pending Yetus). was (Author: subru): Thanks [~varun_saxena] for the thorough investigation to dig this up. +1 on the patch (pending Yetus). > Fix miscellaneous issues in ATSv2 after merge to branch-2 > - > > Key: YARN-7476 > URL: https://issues.apache.org/jira/browse/YARN-7476 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-7476-branch-2.01.patch > > > a) We are still using Resource#getMemory in > NMTimelinePublisher#publishContainerCreatedEvent. This has been deprecated > since YARN-4844. Better to use getMemorySize instead. > b) Post YARN-5865, application priority should be fetched from RMAppImpl > instead of app submission context. But we are still fetching it from > submission context while publishing entities to timeline service. This would > mean that if priority is updated, it will not be published to timeline > service. > c) The order of app_collectors in NodeHeartbeatResponseProto is different > from trunk. Better to make it consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7476) Fix miscellaneous issues in ATSv2 after merge to branch-2
[ https://issues.apache.org/jira/browse/YARN-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248715#comment-16248715 ] Subru Krishnan commented on YARN-7476: -- Thanks [~varun_saxena] for the thorough investigation to dig this up. +1 on the patch (pending Yetus). > Fix miscellaneous issues in ATSv2 after merge to branch-2 > - > > Key: YARN-7476 > URL: https://issues.apache.org/jira/browse/YARN-7476 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-7476-branch-2.01.patch > > > a) We are still using Resource#getMemory in > NMTimelinePublisher#publishContainerCreatedEvent. This has been deprecated > since YARN-4844. Better to use getMemorySize instead. > b) Post YARN-5865, application priority should be fetched from RMAppImpl > instead of app submission context. But we are still fetching it from > submission context while publishing entities to timeline service. This would > mean that if priority is updated, it will not be published to timeline > service. > c) The order of app_collectors in NodeHeartbeatResponseProto is different > from trunk. Better to make it consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org