[jira] [Updated] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-9270: Priority: Minor (was: Major) Hadoop Flags: Reviewed +1, latest patch looks good to me. > Minor cleanup in TestFpgaDiscoverer > --- > > Key: YARN-9270 > URL: https://issues.apache.org/jira/browse/YARN-9270 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Attachments: YARN-9270-001.patch, YARN-9270-002.patch, > YARN-9270-003.patch, YARN-9270-004.patch, YARN-9270-005.patch > > > Let's do some cleanup in this class. > * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split > up to 5 different tests, because it tests 5 different scenarios. > * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a > {{Function}} in the plugin class like {{Function envProvider > = System::getenv()}} plus a setter method which allows the test to modify > {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803090#comment-16803090 ] Devaraj K commented on YARN-9270: - [~pbacsko], can you rebase this patch? > Minor cleanup in TestFpgaDiscoverer > --- > > Key: YARN-9270 > URL: https://issues.apache.org/jira/browse/YARN-9270 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9270-001.patch, YARN-9270-002.patch, > YARN-9270-003.patch > > > Let's do some cleanup in this class. > * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split > up to 5 different tests, because it tests 5 different scenarios. > * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a > {{Function}} in the plugin class like {{Function envProvider > = System::getenv()}} plus a setter method which allows the test to modify > {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9269) Minor cleanup in FpgaResourceAllocator
[ https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-9269: Priority: Minor (was: Major) Hadoop Flags: Reviewed +1, latest patch looks good to me, committing it shortly. > Minor cleanup in FpgaResourceAllocator > -- > > Key: YARN-9269 > URL: https://issues.apache.org/jira/browse/YARN-9269 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Attachments: YARN-9269-001.patch, YARN-9269-002.patch, > YARN-9269-003.patch, YARN-9269-004.patch, YARN-9269-005.patch > > > Some stuff that we observed: > * {{addFpga()}} - we check for duplicate devices, but we don't print any > error/warning if there's any. > * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is > this method even needed? We already receive an {{FpgaDevice}} instance in > {{updateFpga()}} which I believe is the same that we're looking up. > * variable {{IPIDpreference}} is confusing > * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of > {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple > {{HashMap}} suffice? > * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear > * {{allowedFpgas}} should be an immutable list > * {{@VisibleForTesting}} methods should be package private > * get rid of {{*}} imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9268) General improvements in FpgaDevice
[ https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-9268: Hadoop Flags: Reviewed > General improvements in FpgaDevice > -- > > Key: YARN-9268 > URL: https://issues.apache.org/jira/browse/YARN-9268 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9268-001.patch, YARN-9268-002.patch, > YARN-9268-003.patch, YARN-9268-004.patch, YARN-9268-005.patch, > YARN-9268-006.patch, YARN-9268-007.patch > > > Need to fix the following in the class {{FpgaDevice}}: > * It implements {{Comparable}}, but returns 0 in every case. There is no > natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but > this seems too forced and unnecessary.We think this class should not > implement {{Comparable}} at all, at least not like that. > * Stores unnecessary fields: devName, busNum, temperature, power usage. For > one, these are never needed in the code. Secondly, temp and power usage > changes constantly. It's pointless to store these in this POJO. > * {{serialVersionUID}} is 1L - let's generate a number for this > * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor > uniquely identifies the card, then let's demand them in the constructor and > don't store Integers that can be null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9268) General improvements in FpgaDevice
[ https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801025#comment-16801025 ] Devaraj K commented on YARN-9268: - +1, latest patch looks good to me. > General improvements in FpgaDevice > -- > > Key: YARN-9268 > URL: https://issues.apache.org/jira/browse/YARN-9268 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9268-001.patch, YARN-9268-002.patch, > YARN-9268-003.patch, YARN-9268-004.patch, YARN-9268-005.patch, > YARN-9268-006.patch, YARN-9268-007.patch > > > Need to fix the following in the class {{FpgaDevice}}: > * It implements {{Comparable}}, but returns 0 in every case. There is no > natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but > this seems too forced and unnecessary.We think this class should not > implement {{Comparable}} at all, at least not like that. > * Stores unnecessary fields: devName, busNum, temperature, power usage. For > one, these are never needed in the code. Secondly, temp and power usage > changes constantly. It's pointless to store these in this POJO. > * {{serialVersionUID}} is 1L - let's generate a number for this > * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor > uniquely identifies the card, then let's demand them in the constructor and > don't store Integers that can be null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9268) General improvements in FpgaDevice
[ https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798723#comment-16798723 ] Devaraj K commented on YARN-9268: - Thanks [~pbacsko] for quickly updating the patch. * FpgaResourceAllocator.java ** {{aliasDevName}} is used in {{hashCode()}} but not in {{equals()}}. ** There are some fields not used in {{hashCode()}} and {{equals()}}, don't we need to include here? ** can you correct the typo here, {code} //key is requetor, aka. container ID {code} * TestFpgaResourceHandler.java ** Seems this change is not needed, same applies for all occurrences in this test class. {code} - for (FpgaDevice device : allowedDevices) { + for (FpgaResourceAllocator.FpgaDevice device : allowedDevices) { {code} > General improvements in FpgaDevice > -- > > Key: YARN-9268 > URL: https://issues.apache.org/jira/browse/YARN-9268 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9268-001.patch, YARN-9268-002.patch, > YARN-9268-003.patch, YARN-9268-004.patch, YARN-9268-005.patch > > > Need to fix the following in the class {{FpgaDevice}}: > * It implements {{Comparable}}, but returns 0 in every case. There is no > natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but > this seems too forced and unnecessary.We think this class should not > implement {{Comparable}} at all, at least not like that. > * Stores unnecessary fields: devName, busNum, temperature, power usage. For > one, these are never needed in the code. Secondly, temp and power usage > changes constantly. It's pointless to store these in this POJO. > * {{serialVersionUID}} is 1L - let's generate a number for this > * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor > uniquely identifies the card, then let's demand them in the constructor and > don't store Integers that can be null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798312#comment-16798312 ] Devaraj K commented on YARN-9267: - +1, latest patch looks good to me, committing it shortly. > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, > YARN-9267-006.patch, YARN-9267-007.patch, YARN-9267-008.patch, > YARN-9267-009.patch, YARN-9267-010.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798229#comment-16798229 ] Devaraj K commented on YARN-9267: - Thanks [~pbacsko] for updating the patch, can you also take care of this checkstyle? {code} -0 checkstyle 0m 23s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 1 new + 46 unchanged - 6 fixed = 47 total (was 52) {code} {code} ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/fpga/TestFpgaResourceHandler.java:322: throws ResourceHandlerException, PrivilegedOperationException, IOException {: Line is longer than 80 characters (found 82). [LineLength] {code} > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, > YARN-9267-006.patch, YARN-9267-007.patch, YARN-9267-008.patch, > YARN-9267-009.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797741#comment-16797741 ] Devaraj K commented on YARN-9267: - Please remove this log message, it makes to double log the error. {code} + LOG.error("Could not calculate SHA-256", e); {code} Other than that patch looks good to me. > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, > YARN-9267-006.patch, YARN-9267-007.patch, YARN-9267-008.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797302#comment-16797302 ] Devaraj K commented on YARN-9267: - bq. It's usually a good practice to do in unit tests, but I'm not a fundamentalist so I can go a file-creation way if you think it's better. It is making to log the original cause and creating {{ResourceHandlerException}} to throw without the original reason. Please update to invoke {{getSha256ofFile}} directly. > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, > YARN-9267-006.patch, YARN-9267-007.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796800#comment-16796800 ] Devaraj K commented on YARN-9267: - Thanks [~pbacsko] for updating the patch. * FpgaResourceHandlerImpl.java ** I am not sure whether this is really needed, I think {{getSha256ofFile}} can be invoked directly and with that {{if (!hashOpt.isPresent()) {}} also can be avoided. {code:xml} + private Function> digestProvider = + this::getSha256ofFile; {code} ** With the above fix, can you also update here to throw the exception directly as a wrapped one and avoid logging. {code:xml} + LOG.error("Could not calculate SHA-256", e); {code} * TestFpgaResourceHandler.java ** Can we have a loop here to add the {{FpgaDevice}} objects into {{deviceList}}, so that this duplicate code can be removed. {code:xml} +deviceList.add(new FpgaDevice(vendorType, 247, 0, null)); . +deviceList.add(new FpgaDevice(vendorType, 247, 4, null)); {code} > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, > YARN-9267-006.patch, YARN-9267-007.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9268) General improvements in FpgaDevice
[ https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795671#comment-16795671 ] Devaraj K commented on YARN-9268: - Thanks [~pbacsko] for the patch, latest patch is not applying to trunk, please update it. > General improvements in FpgaDevice > -- > > Key: YARN-9268 > URL: https://issues.apache.org/jira/browse/YARN-9268 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9268-001.patch, YARN-9268-002.patch, > YARN-9268-003.patch > > > Need to fix the following in the class {{FpgaDevice}}: > * It implements {{Comparable}}, but returns 0 in every case. There is no > natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but > this seems too forced and unnecessary.We think this class should not > implement {{Comparable}} at all, at least not like that. > * Stores unnecessary fields: devName, busNum, temperature, power usage. For > one, these are never needed in the code. Secondly, temp and power usage > changes constantly. It's pointless to store these in this POJO. > * {{serialVersionUID}} is 1L - let's generate a number for this > * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor > uniquely identifies the card, then let's demand them in the constructor and > don't store Integers that can be null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795669#comment-16795669 ] Devaraj K commented on YARN-9267: - Thanks [~pbacsko] for the patch, latest patch has gone stale, can you update the patch? > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, > YARN-9267-006.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795675#comment-16795675 ] Devaraj K commented on YARN-9270: - Thanks [~pbacsko] for the patch, latest patch is not getting applied, please update it. > Minor cleanup in TestFpgaDiscoverer > --- > > Key: YARN-9270 > URL: https://issues.apache.org/jira/browse/YARN-9270 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9270-001.patch, YARN-9270-002.patch, > YARN-9270-003.patch > > > Let's do some cleanup in this class. > * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split > up to 5 different tests, because it tests 5 different scenarios. > * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a > {{Function}} in the plugin class like {{Function envProvider > = System::getenv()}} plus a setter method which allows the test to modify > {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9269) Minor cleanup in FpgaResourceAllocator
[ https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795674#comment-16795674 ] Devaraj K commented on YARN-9269: - Thanks [~pbacsko] for the patch, latest patch is not getting applied, please update it. > Minor cleanup in FpgaResourceAllocator > -- > > Key: YARN-9269 > URL: https://issues.apache.org/jira/browse/YARN-9269 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9269-001.patch, YARN-9269-002.patch, > YARN-9269-003.patch > > > Some stuff that we observed: > * {{addFpga()}} - we check for duplicate devices, but we don't print any > error/warning if there's any. > * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is > this method even needed? We already receive an {{FpgaDevice}} instance in > {{updateFpga()}} which I believe is the same that we're looking up. > * variable {{IPIDpreference}} is confusing > * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of > {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple > {{HashMap}} suffice? > * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear > * {{allowedFpgas}} should be an immutable list > * {{@VisibleForTesting}} methods should be package private > * get rid of {{*}} imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793322#comment-16793322 ] Devaraj K commented on YARN-9267: - Sorry for coming in late. Thanks [~pbacsko] for the patch and [~snemeth] & [~tangzhankun] for the reviews. Patch overall looks good to me, * Have you thought of using the existing library api like org.apache.commons.codec.digest.DigestUtils.sha256Hex(InputStream data), so that we don't have to add Sha256Calculator and tests for that. > General improvements in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch, YARN-9267-004.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402596#comment-16402596 ] Devaraj K commented on YARN-5764: - Thanks [~miklos.szeg...@cloudera.com] for review and commit, [~leftnoteasy] and others for reviews. [~miklos.szeg...@cloudera.com], is there any reason to keep this as still 'Unresolved'? > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v10.patch, > YARN-5764-v11.patch, YARN-5764-v2.patch, YARN-5764-v3.patch, > YARN-5764-v4.patch, YARN-5764-v5.patch, YARN-5764-v6.patch, > YARN-5764-v7.patch, YARN-5764-v8.patch, YARN-5764-v9.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v11.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v10.patch, > YARN-5764-v11.patch, YARN-5764-v2.patch, YARN-5764-v3.patch, > YARN-5764-v4.patch, YARN-5764-v5.patch, YARN-5764-v6.patch, > YARN-5764-v7.patch, YARN-5764-v8.patch, YARN-5764-v9.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v10.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v10.patch, > YARN-5764-v2.patch, YARN-5764-v3.patch, YARN-5764-v4.patch, > YARN-5764-v5.patch, YARN-5764-v6.patch, YARN-5764-v7.patch, > YARN-5764-v8.patch, YARN-5764-v9.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v9.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, > YARN-5764-v6.patch, YARN-5764-v7.patch, YARN-5764-v8.patch, YARN-5764-v9.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v8.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, > YARN-5764-v6.patch, YARN-5764-v7.patch, YARN-5764-v8.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372275#comment-16372275 ] Devaraj K commented on YARN-5764: - [~miklos.szeg...@cloudera.com] Thanks for comments. bq. Is MB not supported? Here conversion is happening to MB, directly taking the value if it is already in MB. bq. Containers can change their resource usage. I do not see that supported, yet. It may need another jira. Agree, will create an another jira to handle this. I have addressed the other comments in the patch, please have a look into the patch. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, > YARN-5764-v6.patch, YARN-5764-v7.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v7.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, > YARN-5764-v6.patch, YARN-5764-v7.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v6.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, YARN-5764-v6.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361509#comment-16361509 ] Devaraj K commented on YARN-5764: - [~miklos.szeg...@cloudera.com] Sorry for the delay, I will update the patch. Thanks for reminding me. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K >Priority: Major > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206622#comment-16206622 ] Devaraj K commented on YARN-5764: - Thanks [~miklos.szeg...@cloudera.com] for the review. bq. I see commented code in the body of the function and also in the unit tests. I commented the recovery code and related tests code since the YARN-7033 was not committed by the time when the patch created. bq. is package-info.java necessary? It is necessary and it adds a checkstyle error if we don't have it. I will update the patch with the comments fixed and uncommented code. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166780#comment-16166780 ] Devaraj K commented on YARN-6620: - Thanks [~leftnoteasy] for the quick patch. bq. Switched JAXB to handle XML parsing instead of check tags. Overall looking good, it would be better if you could group the adapter and other supporting classes as inner classes in PerGpuDeviceInformation. > [YARN-6223] NM Java side code changes to support isolate GPU devices by using > CGroups > - > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165119#comment-16165119 ] Devaraj K commented on YARN-6620: - Thanks [~leftnoteasy] for the responses. bq. My understanding of JAXBContext is mostly used when we need to convert between object and XML/JSON. Since output of nvidia-smi is a customized XML format, which doesn't follow JAXB standard. Is it still best practice to use JAXBContext under such use case? For example, FairScheduler parses XML file directly: AllocationFileLoaderService#reloadAllocations. JAXBContext can be used for any XML format, doesn't have to be in any specific format, I could see that the sample format in the patch can be converted to a Java Object ,so that we can eliminate the traversing and parsing logic in GpuDeviceInformationParser.java. bq. I considered this option before, unless there's strong need for this to run different command or call Nvidia native APIs directly, I would prefer to hard code to use nvidia-smi instead of introducing another abstraction layer. I'm open to do refactoring to support this case once we have such requirements. I think it would be useful if users have sym links created with different names than the hard coded name. I feel we don't have to add a new configuration for the executable instead we can have the binary name also as part of DEFAULT_NM_GPU_PATH_TO_EXEC and users can provide the path with the executable name for the configuration 'yarn.nodemanager.resource.gpu.path-to-executables'. > [YARN-6223] NM Java side code changes to support isolate GPU devices by using > CGroups > - > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162491#comment-16162491 ] Devaraj K commented on YARN-6620: - Thanks [~leftnoteasy] for the patch, Great work! There are some comments on the patch. 1. XML file reading in GpuDeviceInformationParser.java, can we use the existing libraries like javax.xml.bind.JAXBContext to unmarshall the XML document to a Java Object instead of reading tag by tag? 2. If you don't agree to use the existing libraries for reading xml file, 'in' stream may have to be closed after reading/parsing. {code:xml} InputStream in = IOUtils.toInputStream(sanitizeXmlInput(xmlStr), "UTF-8"); doc = builder.parse(in); {code} 3. Instead of hardcoding the BINARY_NAME, can it be included as part of DEFAULT_NM_GPU_PATH_TO_EXEC as a default value, so that it can be also becomes configurable if incase users want to change it. {code:xml} public static final String DEFAULT_NM_GPU_PATH_TO_EXEC = ""; protected static final String BINARY_NAME = "nvidia-smi"; {code} 4. Please change the inline comment here accordingly. {code:xml} + /** + * Disk as a resource is disabled by default. + **/ + @Private + public static final boolean DEFAULT_NM_GPU_RESOURCE_ENABLED = false; {code} 5. Can we use spaces instead of tab characters for indentation in nvidia-smi-sample-output.xml? 6. Are we going to support multiple containers/processes(limited number) sharing the same GPU device? 7. {code:title=GpuResourceAllocator.java|borderStyle=solid} for (int deviceNum : allowedGpuDevices) { if (!usedDevices.containsKey(deviceNum)) { usedDevices.put(deviceNum, containerId); assignedGpus.add(deviceNum); if (assignedGpus.size() == numRequestedGpuDevices) { break; } } } // Record in state store if we allocated anything if (!assignedGpus.isEmpty()) { List allocatedDevices = new ArrayList<>(); for (int gpu : assignedGpus) { allocatedDevices.add(String.valueOf(gpu)); } {code} Can you merge these two for loops into a one like below, {code:xml} usedDevices.put(deviceNum, containerId); assignedGpus.add(deviceNum); allocatedDevices.add(String.valueOf(deviceNum)); {code} And also if the condition *if (assignedGpus.size() == numRequestedGpuDevices)* doesn't meet, do we need to throw an exception or log the error? 8. I see that getGpuDeviceInformation() is getting invoked twice which intern executes shell command and parses the xml file which are costly operations. Do we need to execute it twice here? {code:title=GpuResourceDiscoverPlugin.java|borderStyle=solid} GpuDeviceInformation info = getGpuDeviceInformation(); LOG.info("Trying to discover GPU information ..."); GpuDeviceInformation info = getGpuDeviceInformation(); {code} And also I don't convince that having the logic other than assigning conf in setConf() method. {code:xml} public synchronized void setConf(Configuration conf) { this.conf = conf; numOfErrorExecutionSinceLastSucceed = 0; featureEnabled = conf.getBoolean(YarnConfiguration.NM_GPU_RESOURCE_ENABLED, YarnConfiguration.DEFAULT_NM_GPU_RESOURCE_ENABLED); if (featureEnabled) { String dir = conf.get(YarnConfiguration.NM_GPU_PATH_TO_EXEC, . {code} And also there are Hadoop QA reported comments which needs to be fixed. > [YARN-6223] NM Java side code changes to support isolate GPU devices by using > CGroups > - > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156530#comment-16156530 ] Devaraj K commented on YARN-7033: - Thanks [~sunilg] for the review and agreeing with us. [~leftnoteasy], can you commit this if you don't have any further comments? Thanks > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch, YARN-7033-v3.patch, YARN-7033-v4.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150734#comment-16150734 ] Devaraj K commented on YARN-7033: - [~sunilg], Thanks again for looking into this. bq. In NMLeveldbStateStoreService, CONTAINER_ASSIGNED_RESOURCES_KEY_SUFFIX will be updated only in case of GPU's, NUMA, FPGA's cases correct. If some one adds a custom resource after YARN-3926, will this code hit ? Here CONTAINER_ASSIGNED_RESOURCES_KEY_SUFFIX is combined with the resourceType and is used as key for each container assigned resources of that particular type, It should work for any resourceType. There is nothing to bind only for these GPU's, NUMA, FPGA's types. Please let me know if doesn't clarifies or any other thoughts. > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch, YARN-7033-v3.patch, YARN-7033-v4.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149295#comment-16149295 ] Devaraj K commented on YARN-7033: - [~sunilg], can you check the latest patch? Thanks > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch, YARN-7033-v3.patch, YARN-7033-v4.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-7033: Attachment: YARN-7033-v4.patch The previous patch could not be applied due to the recent commits, attaching the rebased patch with latest changes. > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch, YARN-7033-v3.patch, YARN-7033-v4.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-7033: Attachment: YARN-7033-v3.patch > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch, YARN-7033-v3.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146663#comment-16146663 ] Devaraj K commented on YARN-7033: - Thanks [~leftnoteasy] and [~sunilg] for the confirmation, I will update the patch with the revert of enum change. > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-7033: Attachment: YARN-7033-v2.patch > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch, > YARN-7033-v2.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v5.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v4.patch Updated the patch to use ResourceHandlerModule API's. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch, YARN-5764-v4.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7079) to support nodemanager ports management
[ https://issues.apache.org/jira/browse/YARN-7079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138895#comment-16138895 ] Devaraj K commented on YARN-7079: - [~tianjuan428], Thanks for the patch, patch seems to be quite large. I think you've spent good effort on this, Can you also upload the design draft/details if you have any? > to support nodemanager ports management > - > > Key: YARN-7079 > URL: https://issues.apache.org/jira/browse/YARN-7079 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: 田娟娟 > Attachments: YARN_7079.001.patch > > > Just like the vcores and memory, ports is also important resource > information to job allocation . So we add the ports management logic to yarn. > It can meet the user jobs' ports request, and never allocate two jobs(with > same port requirement) to one machine. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-7033: Attachment: YARN-7033-v1.patch Attaching patch with checkstyle and whitespace errors fixes. > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch, YARN-7033-v1.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
[ https://issues.apache.org/jira/browse/YARN-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-7033: Attachment: YARN-7033-v0.patch Attaching the patch which contains the common code from YARN-6620 to handle the assigned resources recovery. > Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to > container > --- > > Key: YARN-7033 > URL: https://issues.apache.org/jira/browse/YARN-7033 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Devaraj K >Assignee: Devaraj K > Attachments: YARN-7033-v0.patch > > > This JIRA adds the common logic to store the assigned resources to container > such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and > recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129994#comment-16129994 ] Devaraj K commented on YARN-5764: - I've created YARN-7033 to move the common logic from YARN-6620 to handle the recovery of assigned resources. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7033) Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container
Devaraj K created YARN-7033: --- Summary: Add support for NM Recovery of assigned resources(GPU's, NUMA, FPGA's) to container Key: YARN-7033 URL: https://issues.apache.org/jira/browse/YARN-7033 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Devaraj K Assignee: Devaraj K This JIRA adds the common logic to store the assigned resources to container such as GPU's(YARN-6620), NUMA(YARN-5764) and FPGA's(YARN-5983) etc. and recover upon restart of NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129294#comment-16129294 ] Devaraj K commented on YARN-5764: - Thanks [~leftnoteasy] for the details and the direction. bq. however since ResourceHandler API is not added to DefaultContainerExecutor, it needs some extra effort to bring ResourceHandlerModule API to DefaultContainerExecutor, which I'm not sure if it worths If it is not worth making changes to support DefaultContainerExecutor, we can proceed with LinuxContainerExecutor now and see the feasibility in the feature for DefaultContainerExecutor. bq. If you plan to work on this feature in short term (say 1 month), we may need to split common libraries to a separate JIRA and commit to trunk first to unblock this one. I can do it two weeks after, if you want to speed it up, please feel free to take it up. I can take it up this task to split the common code from YARN-6620 to separate JIRA to handle the NM recovery of assigned resources. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128521#comment-16128521 ] Devaraj K commented on YARN-5764: - Hi [~leftnoteasy], bq. It added numa controller for both default container executor and linux container executor, does it make sense to use this feature under default container executor since CPU asks might be ignored in RM side (so asking 100 vcores is same as asking 1 vcores). I think it would be useful when the user uses default container executor with DominantResourceCalculator, please correct me if I am wrong. Thanks > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128107#comment-16128107 ] Devaraj K commented on YARN-5764: - Thanks [~leftnoteasy] for looking into the patch and for the suggestions, will update the patch with the suggestions. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v3.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, > YARN-5764-v3.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5983) [Umbrella] Support for FPGA as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988319#comment-15988319 ] Devaraj K commented on YARN-5983: - Thanks [~tangzhankun] and [~zyluo] for the design doc and hardwork, [~leftnoteasy] for the discussion. 1. {code:xml} The scheduler only considers non-exclusive resource. The exclusive resources may have extra attributes needs to be matched when scheduling. Not just simply add or reduce a number. For instance, in our PoC, a FPGA slot in one node may already have one IP flashed so that the scheduler should try to match this IP attribute to reuse it. {code} If you are passing all the attributes of the FPGA resources to RM scheduler, why do you want to have the NM side resource management? Can you give some details about the attributes passing to the RM and details maintain by the NM side resource management in abstract terms? 2. {code:xml} Device resource needs additional preparation and isolation before container launch. For instance, FPGA device may need to download an IP file from a repo then flash to an allocated FPGA slot. {code} Does this need to be done for each container, Can it be done one time during the cluster installation? 3. Can FPGA slots share my multiple containers? How do we prevent if any container(Non FPGA allocated container)/application try to use the FPGA resources which are not allocated to them? 4. Any changes to ContainerExecutor, how does the application code running in the container come to know about the allocated FPGA resource to access/use the FPFA? 5. What are the configurations user to need to configure for the application to use FPGA resources? > [Umbrella] Support for FPGA as a Resource in YARN > - > > Key: YARN-5983 > URL: https://issues.apache.org/jira/browse/YARN-5983 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang > Attachments: YARN-5983-Support-FPGA-resource-on-NM-side_v1.pdf > > > As various big data workload running on YARN, CPU will no longer scale > eventually and heterogeneous systems will become more important. ML/DL is a > rising star in recent years, applications focused on these areas have to > utilize GPU or FPGA to boost performance. Also, hardware vendors such as > Intel also invest in such hardware. It is most likely that FPGA will become > popular in data centers like CPU in the near future. > So YARN as a resource managing and scheduling system, would be great to > evolve to support this. This JIRA proposes FPGA to be a first-class citizen. > The changes roughly includes: > 1. FPGA resource detection and heartbeat > 2. Scheduler changes > 3. FPGA related preparation and isolation before launch container > We know that YARN-3926 is trying to extend current resource model. But still > we can leave some FPGA related discussion here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963602#comment-15963602 ] Devaraj K commented on YARN-5764: - Thanks [~rajesh.balamohan] for taking a look into this. bq. Was this flag (-XX:useNUMA) enabled in the tasks when running the benchmark? yes, I used {{-XX:+UseNUMA}} for running the benchmark. bq. Hive on MR is outdated, network intensive and slow. It would be great, if BB benchmark can be run with Hive on Tez which optimizes queries to a great extent. It has much better resource utilization and also elimiates a lot of IO barriers and would be a lot efficient than MR codebase. I haven't tried the BB with Hive on Tez, Here we are not evaluating the BB execution engines performance and I think 'Hive on MR' or any other component would be ok to show case the performance benefits of NUMA patch. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v2.patch Updating the patch with the 'interleave' option for memory. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: NUMA Performance Results.pdf > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance > Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819132#comment-15819132 ] Devaraj K commented on YARN-5764: - bq. Do you have any benchmarks results that would illustrate the kind of performance gains that could potentially be realised with this patch? Thanks [~raviprak] for going through this. I will share the performance results here. Thanks [~sunilg] for the comments. bq. if NM is taking the decision based on cores (NUMA cpus), it ll be more container specific. Could we apply it more of application specific where few apps containers only will be NUMA aware. bq. Also I think such NUMA aware nodes could be controlled within a specific nodelabel, I think it may yield better use cases for NUMA. So during NM init, such awareness info could be passed to RM and it can be made as node attribute. Such nodes could then be labelled together as well. If we want to run an application only on NUMA aware nodes, we can group NUMA aware nodes into a node-label and specify this node-label for the application. I am wondering why do some applications don't want to run in NUMA if the NM supports and getting some perf gain for making this as applications specific. We can also include this as an attribute once the constraint node labels(YARN-3409) feature gets in. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, > YARN-5764-v0.patch, YARN-5764-v1.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816512#comment-15816512 ] Devaraj K commented on YARN-5764: - Thanks a lot [~leftnoteasy] for review and comments. bq. What is the benefit to manually specify NUMA node? Since this is potentially complex for end user to specify, I think it's better to directly read data from OS. If the users want to share the NUMA resources in Node Manager machine for non-Yarn applications, then users can specify what all numa nodes and each node capabilities can be used by Yarn using this declaration. I understand there are configurations for specifying numa nodes, each node memory and cpu's. But if we don't have provision for separating the NUMA resources for Yarn, we could end up overlapping the resources used by Yarn and Non-Yarn applications. bq. Does the changes work on platform other than Linux? This patch works for Linux, if this approach is agreeable then I will update for windows as well. bq. I'm not quite sure about if this could happen: with this patch, YARN will launch process one by one on each NUMA node to bind memory/cpu. Is it possible that there's another process (outside of YARN) uses memory of NUMA node which causes processes launched by YARN failed to bind or run? I do think it could happen for memory, we can avoid this using the NUMA node topology declaration for specifying the NUMA resources for Yarn applications. And also it would not be an issue with the soft binding option which you mentioned in the below comment. bq. This patch uses hard binding (get allocated resource on specified node or fail), is it better to specify soft binding (prefer to allocate and can also accept other node). I think soft binding should be default behavior to support NUMA. I think it is a good suggestion, I can update the patch with this by changing '\--membind=nodes' to '\--preferred=node'. I will look forward for your further comments. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, > YARN-5764-v0.patch, YARN-5764-v1.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806024#comment-15806024 ] Devaraj K commented on YARN-5764: - Thanks [~rohithsharma] for going through this. bq. NUMA resources is scheduled by by NodeManager. Why can't RM make the decision of scheduling NUMA resources using resource profilers.? With NUMA, memory blocks and processors in a single machine divided into numa nodes, and processors in the numa node can access the memory faster which is local to it. If we want to make RM to schedule this information, each NM has to send the numa nodes information(i.e. List{(numanode-id, processors, memory),..} to RM and RM has to maintain this information including the usage details for scheduling. At present RM already does the scheduling of NM memory and vcores as a whole and I think it is cumbersome to move numa nodes scheduling which is granular level scheduling to RM. bq. Could you elaborate, why there are multiple numa-awareness.node-ids in single machine? In Non-Uniform Memory Access model(NUMA), memory blocks and processors in a single machine divided into multiple numa nodes, and each numa node has an id assigned to it. When the user/application want to make use of the numa resources, then the process should be bind to those numa node-ids. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, > YARN-5764-v0.patch, YARN-5764-v1.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6061) Add a customized uncaughtexceptionhandler for fair scheduler
[ https://issues.apache.org/jira/browse/YARN-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805654#comment-15805654 ] Devaraj K commented on YARN-6061: - Should not handle this? https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1378 > Add a customized uncaughtexceptionhandler for fair scheduler > > > Key: YARN-6061 > URL: https://issues.apache.org/jira/browse/YARN-6061 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler, yarn >Reporter: Yufei Gu >Assignee: Yufei Gu > Labels: fairscheduler > > There are several threads in fair scheduler. The thread will quit when there > is a runtime exception inside it. We should bring down the RM when that > happens. Otherwise, there may be some weird behavior in RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v1.patch > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, > YARN-5764-v0.patch, YARN-5764-v1.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: YARN-5764-v0.patch Attaching the patch for this. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf, > YARN-5764-v0.patch > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Attachment: NUMA Awareness for YARN Containers.pdf Please find the attached proposal document and provide your feedback/suggestions. I will upload a patch soon with this approach for better understanding. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > Attachments: NUMA Awareness for YARN Containers.pdf > > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned YARN-5764: --- Assignee: Devaraj K I will upload the design proposal for this. > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn > Environment: SW: CentOS 6.7, Hadoop 2.6.0 > Processors: Intel Xeon CPU E5-2699 v4 @2.20GHz > Memory: 256GB 4 NUMA nodes >Reporter: Olasoji >Assignee: Devaraj K > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Environment: (was: SW: CentOS 6.7, Hadoop 2.6.0 Processors: Intel Xeon CPU E5-2699 v4 @2.20GHz Memory: 256GB 4 NUMA nodes) > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Reporter: Olasoji >Assignee: Devaraj K > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3409) Add constraint node labels
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15697368#comment-15697368 ] Devaraj K commented on YARN-3409: - Thanks [~Naganarasimha]/[~varun_saxena] for the document and others for the discussion. - {code:xml} String labelExpression, String constraintLabelExpression, // New modification in the interface {code} - As Bibin mentioned above, 'constraintLabelExpression' naming leads to confusion that why do we need two label expressions. I too think we need to have different naming if we are going to have this param/configs. - Can NodeManagers have attribute names same as some label/partition name in the cluster? Did you think about having one expression(existing) which handles node label expression and constraints expression without delimiter between label and constraints expressions, constraints expression support implementation can be added without any new configurations/interfaces. - Can we have some details about how the NodeManager report these attributes to ResourceManager? > Add constraint node labels > -- > > Key: YARN-3409 > URL: https://issues.apache.org/jira/browse/YARN-3409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, capacityscheduler, client >Reporter: Wangda Tan >Assignee: Naganarasimha G R > Attachments: Constraint-Node-Labels-Requirements-Design-doc_v1.pdf > > > Specify only one label for each node (IAW, partition a cluster) is a way to > determinate how resources of a special set of nodes could be shared by a > group of entities (like teams, departments, etc.). Partitions of a cluster > has following characteristics: > - Cluster divided to several disjoint sub clusters. > - ACL/priority can apply on partition (Only market team / marke team has > priority to use the partition). > - Percentage of capacities can apply on partition (Market team has 40% > minimum capacity and Dev team has 60% of minimum capacity of the partition). > Constraints are orthogonal to partition, they’re describing attributes of > node’s hardware/software just for affinity. Some example of constraints: > - glibc version > - JDK version > - Type of CPU (x86_64/i686) > - Type of OS (windows, linux, etc.) > With this, application can be able to ask for resource has (glibc.version >= > 2.20 && JDK.version >= 8u20 && x86_64). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3732) Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract classes
[ https://issues.apache.org/jira/browse/YARN-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616520#comment-15616520 ] Devaraj K commented on YARN-3732: - Thanks [~rohithsharma] for the review and commit. > Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as > abstract classes > -- > > Key: YARN-3732 > URL: https://issues.apache.org/jira/browse/YARN-3732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Fix For: 3.0.0-alpha2 > > Attachments: YARN-3732-1.patch, YARN-3732-2.patch, YARN-3732.patch > > > All the other protocol record classes are abstract classes. Change > NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract > classes to make it consistent with other protocol record classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3732) Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract classes
[ https://issues.apache.org/jira/browse/YARN-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614197#comment-15614197 ] Devaraj K commented on YARN-3732: - ASF License warnings are not related to the patch. > Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as > abstract classes > -- > > Key: YARN-3732 > URL: https://issues.apache.org/jira/browse/YARN-3732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Attachments: YARN-3732-1.patch, YARN-3732-2.patch, YARN-3732.patch > > > All the other protocol record classes are abstract classes. Change > NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract > classes to make it consistent with other protocol record classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3732) Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract classes
[ https://issues.apache.org/jira/browse/YARN-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3732: Attachment: YARN-3732-2.patch Updated the patch against trunk. > Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as > abstract classes > -- > > Key: YARN-3732 > URL: https://issues.apache.org/jira/browse/YARN-3732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Attachments: YARN-3732-1.patch, YARN-3732-2.patch, YARN-3732.patch > > > All the other protocol record classes are abstract classes. Change > NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract > classes to make it consistent with other protocol record classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3861) Add fav icon to YARN & MR daemons web UI
[ https://issues.apache.org/jira/browse/YARN-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613564#comment-15613564 ] Devaraj K commented on YARN-3861: - Thanks [~rchiang] for the comment, can you provide the icon if you have that? > Add fav icon to YARN & MR daemons web UI > > > Key: YARN-3861 > URL: https://issues.apache.org/jira/browse/YARN-3861 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: Devaraj K >Assignee: Devaraj K > Labels: oct16-easy > Attachments: RM UI in Chrome-With Patch.png, RM UI in Chrome-Without > Patch.png, RM UI in IE-With Patch.png, RM UI in IE-Without Patch.png.png, > YARN-3861.patch, hadoop-fav.png > > > Add fav icon image to all YARN & MR daemons web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3732) Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract classes
[ https://issues.apache.org/jira/browse/YARN-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611093#comment-15611093 ] Devaraj K commented on YARN-3732: - Thanks [~rohithsharma] for checking this, will update the patch for the trunk. > Change NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as > abstract classes > -- > > Key: YARN-3732 > URL: https://issues.apache.org/jira/browse/YARN-3732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Attachments: YARN-3732-1.patch, YARN-3732.patch > > > All the other protocol record classes are abstract classes. Change > NodeHeartbeatResponse.java and RegisterNodeManagerResponse.java as abstract > classes to make it consistent with other protocol record classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Affects Version/s: (was: 2.6.0) > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn > Environment: SW: CentOS 6.7, Hadoop 2.6.0 > Processors: Intel Xeon CPU E5-2699 v4 @2.20GHz > Memory: 256GB 4 NUMA nodes >Reporter: Olasoji > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers
[ https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-5764: Fix Version/s: (was: 2.6.0) > NUMA awareness support for launching containers > --- > > Key: YARN-5764 > URL: https://issues.apache.org/jira/browse/YARN-5764 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn >Affects Versions: 2.6.0 > Environment: SW: CentOS 6.7, Hadoop 2.6.0 > Processors: Intel Xeon CPU E5-2699 v4 @2.20GHz > Memory: 256GB 4 NUMA nodes >Reporter: Olasoji > > The purpose of this feature is to improve Hadoop performance by minimizing > costly remote memory accesses on non SMP systems. Yarn containers, on launch, > will be pinned to a specific NUMA node and all subsequent memory allocations > will be served by the same node, reducing remote memory accesses. The current > default behavior is to spread memory across all NUMA nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4547) LeafQueue#getApplications() is read-only interface, but it provides reference to caller
[ https://issues.apache.org/jira/browse/YARN-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150015#comment-15150015 ] Devaraj K commented on YARN-4547: - shouldn't it be duplicate instead of Done? > LeafQueue#getApplications() is read-only interface, but it provides reference > to caller > --- > > Key: YARN-4547 > URL: https://issues.apache.org/jira/browse/YARN-4547 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > > The below API is read-only interface, but returning reference to the caller. > This causing caller to modify the orderingPolicy entities. If required > reference of ordering policy, caller can use > {{LeagQueue#getOrderingPolicy()#getSchedulableEntities()}} > The returning object should be clone of > orderingPolicy.getSchedulableEntities() > {code} > /** >* Obtain (read-only) collection of active applications. >*/ > public Collection getApplications() { > return orderingPolicy.getSchedulableEntities(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143175#comment-15143175 ] Devaraj K commented on YARN-4624: - Thanks [~brahmareddy] for the updated patch. {code:xml} + capacities.getMaxAMLimitPercentage() == 0 + ? 0 : capacities.getMaxAMLimitPercentage())). {code} Don't we need to check for null instead of 0 here? Please verify the scenario with the patch changes. > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: YARN-2674-002.patch, YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2266) Add an application timeout service in RM to kill applications which are not getting resources
[ https://issues.apache.org/jira/browse/YARN-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143169#comment-15143169 ] Devaraj K commented on YARN-2266: - Duplicate of YARN-3813 > Add an application timeout service in RM to kill applications which are not > getting resources > - > > Key: YARN-2266 > URL: https://issues.apache.org/jira/browse/YARN-2266 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ashutosh Jindal > > Currently , If an application is submitted to RM, the app keeps waiting until > the resources are allocated for AM. Such an application may be stuck till a > resource is allocated for AM, and this may be due to over utilization of > Queue or User limits etc. In a production cluster, some periodic running > applications may have lesser cluster share. So after waiting for some time, > if resources are not available, such applications can be made as failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4667) RM Admin CLI for refreshNodesResources throws NPE when nothing is configured
[ https://issues.apache.org/jira/browse/YARN-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-4667: Hadoop Flags: Reviewed +1, lgtm, committing it. > RM Admin CLI for refreshNodesResources throws NPE when nothing is configured > > > Key: YARN-4667 > URL: https://issues.apache.org/jira/browse/YARN-4667 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-4667.v1.001.patch > > > {quote} > $ ./yarn rmadmin -refreshNodesResources > 16/02/03 10:54:27 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > refreshNodesResources: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshNodesResources(AdminService.java:655) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshNodesResources(ResourceManagerAdministrationProtocolPBServiceImpl.java:246) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:287) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-65) Reduce RM app memory footprint once app has completed
[ https://issues.apache.org/jira/browse/YARN-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned YARN-65: - Assignee: (was: Devaraj K) > Reduce RM app memory footprint once app has completed > - > > Key: YARN-65 > URL: https://issues.apache.org/jira/browse/YARN-65 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.3 >Reporter: Jason Lowe > > The ResourceManager holds onto a configurable number of completed > applications (yarn.resource.max-completed-applications, defaults to 1), > and the memory footprint of these completed applications can be significant. > For example, the {{submissionContext}} in RMAppImpl contains references to > protocolbuffer objects and other items that probably aren't necessary to keep > around once the application has completed. We could significantly reduce the > memory footprint of the RM by releasing objects that are no longer necessary > once an application completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129844#comment-15129844 ] Devaraj K commented on YARN-4624: - Thanks [~brahmareddy] for reporting and providing patch. Would you mind adding a test for this as part of the patch? > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula >Priority: Blocker > Attachments: YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-4624: Priority: Major (was: Blocker) > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4100) Add Documentation for Distributed and Delegated-Centralized Node Labels feature
[ https://issues.apache.org/jira/browse/YARN-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-4100: Hadoop Flags: Reviewed +1, lgtm, will commit it shortly. > Add Documentation for Distributed and Delegated-Centralized Node Labels > feature > --- > > Key: YARN-4100 > URL: https://issues.apache.org/jira/browse/YARN-4100 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: NodeLabel.html, YARN-4100.v1.001.patch, > YARN-4100.v1.002.patch, YARN-4100.v1.003.patch, YARN-4100.v1.004.patch, > YARN-4100.v1.005.patch > > > Add Documentation for Distributed Node Labels feature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4100) Add Documentation for Distributed and Delegated-Centralized Node Labels feature
[ https://issues.apache.org/jira/browse/YARN-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125994#comment-15125994 ] Devaraj K commented on YARN-4100: - Thanks [~Naganarasimha] for the updated patch with comments fix. The latest patch looks good to me, I will commit it tomorrow unless there are no comments from others. > Add Documentation for Distributed and Delegated-Centralized Node Labels > feature > --- > > Key: YARN-4100 > URL: https://issues.apache.org/jira/browse/YARN-4100 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: NodeLabel.html, YARN-4100.v1.001.patch, > YARN-4100.v1.002.patch, YARN-4100.v1.003.patch, YARN-4100.v1.004.patch, > YARN-4100.v1.005.patch > > > Add Documentation for Distributed Node Labels feature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4411) RMAppAttemptImpl#createApplicationAttemptReport throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-4411: Hadoop Flags: Reviewed Summary: RMAppAttemptImpl#createApplicationAttemptReport throws IllegalArgumentException (was: ResourceManager IllegalArgumentException error) +1, lgtm, will commit it shortly. > RMAppAttemptImpl#createApplicationAttemptReport throws > IllegalArgumentException > --- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: Bibin A Chundatt > Attachments: 0002-YARN-4411.patch, 0003-YARN-4411.patch, > YARN-4411.001.patch > > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121029#comment-15121029 ] Devaraj K commented on YARN-4411: - Thanks [~bibinchundatt] for the explanation. I don't see any problem even if we remove this condition, the test still passes. I see you are trying to do FINAL_SAVING state test explicitly but my argument is that there is no need to restrict createApplicationAttemptReport() invocation here when the state is FINAL_SAVING and can allow to check for all the states including FINAL_SAVING. {code:xml} + if (!rmAppAttemptState.equals(RMAppAttemptState.FINAL_SAVING)) { {code} > ResourceManager IllegalArgumentException error > -- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: Bibin A Chundatt > Attachments: 0002-YARN-4411.patch, YARN-4411.001.patch > > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120871#comment-15120871 ] Devaraj K commented on YARN-4411: - Thanks [~bibinchundatt] for the updated patch. - I don't understand why do we need this, Do you see any problem if we invoke attempt.createApplicationAttemptReport() when the state is other than RMAppAttemptState.FINAL_SAVING? I think we can we can create ApplicationAttemptReport irrespective of the state. {code:xml} + if (!rmAppAttemptState.equals(RMAppAttemptState.FINAL_SAVING)) { {code} - Can you tell me when the application attempt state would be null? If it is not really needed we can remove this assertion and if you have decided to keep this statement then please add an assertion message. {code:xml} + assertTrue(null != attemptreport.getYarnApplicationAttemptState()); {code} > ResourceManager IllegalArgumentException error > -- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: Bibin A Chundatt > Attachments: 0002-YARN-4411.patch, YARN-4411.001.patch > > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4587) IllegalArgumentException in RMAppAttemptImpl#createApplicationAttemptReport
[ https://issues.apache.org/jira/browse/YARN-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118793#comment-15118793 ] Devaraj K commented on YARN-4587: - [~bibinchundatt], Thanks for the quick response and updated patch, I see you are uploading patch in the both jira's. Please close any one as duplicate and continue with the other jira. Thanks > IllegalArgumentException in RMAppAttemptImpl#createApplicationAttemptReport > --- > > Key: YARN-4587 > URL: https://issues.apache.org/jira/browse/YARN-4587 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4587.patch > > > {noformat} > it status: -102 > 2016-01-13 13:35:42,281 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1452672118921_0002_04 State change from RUNNING to FINAL_SAVING > 2016-01-13 13:35:42,286 ERROR org.apache.hadoop.yarn.server.webapp.AppBlock: > Failed to read the attempts of the application application_1452672118921_0002. > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING > at java.lang.Enum.valueOf(Enum.java:238) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:2073) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttempts(ClientRMService.java:436) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$2.run(AppBlock.java:230) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$2.run(AppBlock.java:227) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:226) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.render(RMAppBlock.java:65) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.app(RmController.java:54) > at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) > {noformat} > At {{RMAppAttemptImpl#createApplicationAttemptReport}} > {noformat} >attemptReport = ApplicationAttemptReport.newInstance(this > .getAppAttemptId(), this.getHost(), this.getRpcPort(), this > .getTrackingUrl(), this.getOriginalTrackingUrl(), > this.getDiagnostics(), > YarnApplicationAttemptState.valueOf(this.getState().toString()), > amId, this.startTime, this.finishTime); > {noformat} > {{YarnApplicationAttemptState}} mismatch with {{RMAppAttemptState}} for > FINAL_SAVING -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4100) Add Documentation for Distributed and Delegated-Centralized Node Labels feature
[ https://issues.apache.org/jira/browse/YARN-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118778#comment-15118778 ] Devaraj K commented on YARN-4100: - Thanks [~Naganarasimha] for the patch, Sorry for late here. The latest patch looks fine to me except these below points. - Can you check to re-frame the above sentence something like "Administrators can configure the provider for the node labels by configuring this parameter in NM"? {code:xml} +in RM, Administrators can configure in NM the provider for the node labels by configuring this parameter. {code} - {{This would be helpfull}}, can you correct to helpful here? - {{If user don’t specify “(exclusive=…)”, execlusive}}, please change execlusive to exclusive? - Can you remove the spaces between package name and class name {{org.apache.hadoop.yarn.server.resourcemanager.nodelabels. RMNodeLabelsMappingProvider}}? > Add Documentation for Distributed and Delegated-Centralized Node Labels > feature > --- > > Key: YARN-4100 > URL: https://issues.apache.org/jira/browse/YARN-4100 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: NodeLabel.html, YARN-4100.v1.001.patch, > YARN-4100.v1.002.patch, YARN-4100.v1.003.patch, YARN-4100.v1.004.patch > > > Add Documentation for Distributed Node Labels feature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4587) IllegalArgumentException in RMAppAttemptImpl#createApplicationAttemptReport
[ https://issues.apache.org/jira/browse/YARN-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114998#comment-15114998 ] Devaraj K commented on YARN-4587: - I see as per the conversation in YARN-4411, they both agreed Bibin to provide a patch with test. Providing a patch with test in YARN-4411 or in this jjra would be ok for me. > IllegalArgumentException in RMAppAttemptImpl#createApplicationAttemptReport > --- > > Key: YARN-4587 > URL: https://issues.apache.org/jira/browse/YARN-4587 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4587.patch > > > {noformat} > it status: -102 > 2016-01-13 13:35:42,281 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1452672118921_0002_04 State change from RUNNING to FINAL_SAVING > 2016-01-13 13:35:42,286 ERROR org.apache.hadoop.yarn.server.webapp.AppBlock: > Failed to read the attempts of the application application_1452672118921_0002. > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING > at java.lang.Enum.valueOf(Enum.java:238) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:2073) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttempts(ClientRMService.java:436) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$2.run(AppBlock.java:230) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$2.run(AppBlock.java:227) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:226) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.render(RMAppBlock.java:65) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.app(RmController.java:54) > at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) > {noformat} > At {{RMAppAttemptImpl#createApplicationAttemptReport}} > {noformat} >attemptReport = ApplicationAttemptReport.newInstance(this > .getAppAttemptId(), this.getHost(), this.getRpcPort(), this > .getTrackingUrl(), this.getOriginalTrackingUrl(), > this.getDiagnostics(), > YarnApplicationAttemptState.valueOf(this.getState().toString()), > amId, this.startTime, this.finishTime); > {noformat} > {{YarnApplicationAttemptState}} mismatch with {{RMAppAttemptState}} for > FINAL_SAVING -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4587) IllegalArgumentException in RMAppAttemptImpl#createApplicationAttemptReport
[ https://issues.apache.org/jira/browse/YARN-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114911#comment-15114911 ] Devaraj K commented on YARN-4587: - Thanks [~bibinchundatt] for the patch, changes look good to me except these from test. 1. Here I think we don't need to catch the Exception and make the test fail, instead we can leave the Exception without try/catch and let the test fail with that. {code:xml} } catch (Exception e) { Assert.fail("Exception not expected-->" + stateChecked); } {code} Exception 2. Can we remove this condition here and test for all the states without if check? {code:xml} +if (rmAppAttemptState.equals(RMAppAttemptState.FINAL_SAVING)) { {code} 3. I think there is some unnecessary code {+allocateApplicationAttempt();} and duplication checking, you can remove these. > IllegalArgumentException in RMAppAttemptImpl#createApplicationAttemptReport > --- > > Key: YARN-4587 > URL: https://issues.apache.org/jira/browse/YARN-4587 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4587.patch > > > {noformat} > it status: -102 > 2016-01-13 13:35:42,281 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1452672118921_0002_04 State change from RUNNING to FINAL_SAVING > 2016-01-13 13:35:42,286 ERROR org.apache.hadoop.yarn.server.webapp.AppBlock: > Failed to read the attempts of the application application_1452672118921_0002. > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING > at java.lang.Enum.valueOf(Enum.java:238) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:2073) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttempts(ClientRMService.java:436) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$2.run(AppBlock.java:230) > at > org.apache.hadoop.yarn.server.webapp.AppBlock$2.run(AppBlock.java:227) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705) > at > org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:226) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.render(RMAppBlock.java:65) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.app(RmController.java:54) > at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) > {noformat} > At {{RMAppAttemptImpl#createApplicationAttemptReport}} > {noformat} >attemptReport = ApplicationAttemptReport.newInstance(this > .getAppAttemptId(), this.getHost(), this.getRpcPort(), this > .getTrackingUrl(), this.getOriginalTrackingUrl(), > this.getDiagnostics(), > YarnApplicationAttemptState.valueOf(this.getState().toString()), > amId, this.startTime, this.finishTime); > {noformat} > {{YarnApplicationAttemptState}} mismatch with {{RMAppAttemptState}} for > FINAL_SAVING -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4480) Clean up some inappropriate imports
[ https://issues.apache.org/jira/browse/YARN-4480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned YARN-4480: --- Assignee: Kai Zheng > Clean up some inappropriate imports > --- > > Key: YARN-4480 > URL: https://issues.apache.org/jira/browse/YARN-4480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 2.8.0 > > Attachments: YARN-4480-v1.patch, YARN-4480-v2.patch > > > It was noticed there are some unnecessary dependency into Directory classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3964: Hadoop Flags: Reviewed +1, committing it shortly. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.007.patch, YARN-3964.007.patch, > YARN-3964.008.patch, YARN-3964.009.patch, YARN-3964.010.patch, > YARN-3964.011.patch, YARN-3964.012.patch, YARN-3964.013.patch, > YARN-3964.014.patch, YARN-3964.015.patch, YARN-3964.016.patch, > YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949941#comment-14949941 ] Devaraj K commented on YARN-3964: - Thanks [~dian.fu] for the updated patch. Latest patch looks good to me. I will commit it tomorrow if there are no further comments/objections. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.007.patch, YARN-3964.007.patch, > YARN-3964.008.patch, YARN-3964.009.patch, YARN-3964.010.patch, > YARN-3964.011.patch, YARN-3964.012.patch, YARN-3964.013.patch, > YARN-3964.014.patch, YARN-3964.015.patch, YARN-3964.016.patch, > YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948192#comment-14948192 ] Devaraj K commented on YARN-3964: - Thanks [~leftnoteasy] for review and confirmation, [~Naganarasimha] and [~sunilg] for reviews. Thanks [~dian.fu] for the patch, It mostly looks good to me except these minor comments. 1. Can you update the descriptions for the new configs added in yarn-default.xml {code:xml} +The class to use as the node labels fetcher by ResourceManager. It should +extend org.apache.hadoop.yarn.server.resourcemanager.nodelabels. +RMNodeLabelsMappingProvider. {code} Can you update the description like below, 'When node labels "yarn.node-labels.configuration-type" is of type "delegated-centralized", Administrators can configure the class for fetching node labels by ResourceManager. Configured class needs to extend org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsMappingProvider.' {code:xml} +The interval to use to update node labels by ResourceManager. {code} Can we think of having it like 'This interval is used to update the node labels by ResourceManager.'? And also can we describe here that if the value is '-1' then there will not be any timer task gets created. 2. In TestRMDelegatedNodeLabelsUpdater.java, can we have an assertion in catch block to check the expected exception message? {code:xml} } catch (Exception e) { // expected } {code} 3. Can you file a Jira to update the documentation for this? > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.007.patch, YARN-3964.007.patch, > YARN-3964.008.patch, YARN-3964.009.patch, YARN-3964.010.patch, > YARN-3964.011.patch, YARN-3964.012.patch, YARN-3964.013.patch, > YARN-3964.014.patch, YARN-3964.015.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900290#comment-14900290 ] Devaraj K commented on YARN-3964: - [~leftnoteasy], Sure, Thanks for your interest. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.006.patch, YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3964) Support NodeLabelsProvider at Resource Manager side
[ https://issues.apache.org/jira/browse/YARN-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900227#comment-14900227 ] Devaraj K commented on YARN-3964: - Thanks [~dian.fu] for the patch. Patch has gone stale, Can you please update the patch? And also please take care of the above jenkins warnings in the updated patch. > Support NodeLabelsProvider at Resource Manager side > --- > > Key: YARN-3964 > URL: https://issues.apache.org/jira/browse/YARN-3964 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Dian Fu >Assignee: Dian Fu > Attachments: YARN-3964 design doc.pdf, YARN-3964.002.patch, > YARN-3964.003.patch, YARN-3964.004.patch, YARN-3964.005.patch, > YARN-3964.1.patch > > > Currently, CLI/REST API is provided in Resource Manager to allow users to > specify labels for nodes. For labels which may change over time, users will > have to start a cron job to update the labels. This has the following > limitations: > - The cron job needs to be run in the YARN admin user. > - This makes it a little complicate to maintain as users will have to make > sure this service/daemon is alive. > Adding a Node Labels Provider in Resource Manager will provide user more > flexibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-842) Resource Manager & Node Manager UI's doesn't work with IE
[ https://issues.apache.org/jira/browse/YARN-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K resolved YARN-842. Resolution: Not A Problem It is working fine in the latest, closing it now. Please reopen if you still see this issue. Thanks. > Resource Manager & Node Manager UI's doesn't work with IE > - > > Key: YARN-842 > URL: https://issues.apache.org/jira/browse/YARN-842 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.0.4-alpha >Reporter: Devaraj K > > {code:xml} > Webpage error details > User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; > SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media > Center PC 6.0) > Timestamp: Mon, 17 Jun 2013 12:06:03 UTC > Message: 'JSON' is undefined > Line: 41 > Char: 218 > Code: 0 > URI: http://10.18.40.24:8088/cluster/apps > {code} > RM & NM UI's are not working with IE and showing the above error for every > link on the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3953) Nodemanager is shutting down while executing application
[ https://issues.apache.org/jira/browse/YARN-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636482#comment-14636482 ] Devaraj K commented on YARN-3953: - [~hemenglong] It could be due to jars mismatch, have you changed the jars in the installation by any chance? > Nodemanager is shutting down while executing application > > > Key: YARN-3953 > URL: https://issues.apache.org/jira/browse/YARN-3953 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: hemenglong > > Container expired since it was unused > cleanup failed for container container_1437442699625_0472_01_13 : > java.net.ConnectException: Call From hadoop2/192.168.16.2 to hadoop5:59546 > failed on connection exception: java.net.ConnectException: 拒绝连接; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused > {code:xml} > 2015-07-22 11:02:43,969 ERROR org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NoSuchMethodError: > org.apache.hadoop.yarn.util.FSDownload.createStatusCacheLoader(Lorg/apache/hadoop/conf/Configuration;)Lcom/google/common/cache/CacheLoader; > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handleInitContainerResources(ResourceLocalizationService.java:445) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:398) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:135) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3953) Nodemanager is shutting down while executing application
[ https://issues.apache.org/jira/browse/YARN-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3953: Fix Version/s: (was: 2.5.0) > Nodemanager is shutting down while executing application > > > Key: YARN-3953 > URL: https://issues.apache.org/jira/browse/YARN-3953 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: hemenglong > > Container expired since it was unused > cleanup failed for container container_1437442699625_0472_01_13 : > java.net.ConnectException: Call From hadoop2/192.168.16.2 to hadoop5:59546 > failed on connection exception: java.net.ConnectException: 拒绝连接; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused > {code:xml} > 2015-07-22 11:02:43,969 ERROR org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NoSuchMethodError: > org.apache.hadoop.yarn.util.FSDownload.createStatusCacheLoader(Lorg/apache/hadoop/conf/Configuration;)Lcom/google/common/cache/CacheLoader; > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handleInitContainerResources(ResourceLocalizationService.java:445) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:398) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:135) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3953) Nodemanager is shutting down while executing application
[ https://issues.apache.org/jira/browse/YARN-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3953: Release Note: (was: 2015-07-22 11:02:43,969 ERROR org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.lang.NoSuchMethodError: org.apache.hadoop.yarn.util.FSDownload.createStatusCacheLoader(Lorg/apache/hadoop/conf/Configuration;)Lcom/google/common/cache/CacheLoader; at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handleInitContainerResources(ResourceLocalizationService.java:445) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:398) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:135) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745)) > Nodemanager is shutting down while executing application > > > Key: YARN-3953 > URL: https://issues.apache.org/jira/browse/YARN-3953 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: hemenglong > > Container expired since it was unused > cleanup failed for container container_1437442699625_0472_01_13 : > java.net.ConnectException: Call From hadoop2/192.168.16.2 to hadoop5:59546 > failed on connection exception: java.net.ConnectException: 拒绝连接; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3953) Nodemanager is shutting down while executing application
[ https://issues.apache.org/jira/browse/YARN-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3953: Description: Container expired since it was unused cleanup failed for container container_1437442699625_0472_01_13 : java.net.ConnectException: Call From hadoop2/192.168.16.2 to hadoop5:59546 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused {code:xml} 2015-07-22 11:02:43,969 ERROR org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.lang.NoSuchMethodError: org.apache.hadoop.yarn.util.FSDownload.createStatusCacheLoader(Lorg/apache/hadoop/conf/Configuration;)Lcom/google/common/cache/CacheLoader; at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handleInitContainerResources(ResourceLocalizationService.java:445) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:398) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:135) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} was: Container expired since it was unused cleanup failed for container container_1437442699625_0472_01_13 : java.net.ConnectException: Call From hadoop2/192.168.16.2 to hadoop5:59546 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused > Nodemanager is shutting down while executing application > > > Key: YARN-3953 > URL: https://issues.apache.org/jira/browse/YARN-3953 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: hemenglong > > Container expired since it was unused > cleanup failed for container container_1437442699625_0472_01_13 : > java.net.ConnectException: Call From hadoop2/192.168.16.2 to hadoop5:59546 > failed on connection exception: java.net.ConnectException: 拒绝连接; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused > {code:xml} > 2015-07-22 11:02:43,969 ERROR org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NoSuchMethodError: > org.apache.hadoop.yarn.util.FSDownload.createStatusCacheLoader(Lorg/apache/hadoop/conf/Configuration;)Lcom/google/common/cache/CacheLoader; > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handleInitContainerResources(ResourceLocalizationService.java:445) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:398) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(ResourceLocalizationService.java:135) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636421#comment-14636421 ] Devaraj K commented on YARN-3212: - [~djp], can you update the patch for this Jira as YARN-3445 got committed, so that we can see this feature working. > RMNode State Transition Update with DECOMMISSIONING state > - > > Key: YARN-3212 > URL: https://issues.apache.org/jira/browse/YARN-3212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du > Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, > YARN-3212-v2.patch, YARN-3212-v3.patch, YARN-3212-v4.patch > > > As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and > can transition from “running” state triggered by a new event - > “decommissioning”. > This new state can be transit to state of “decommissioned” when > Resource_Update if no running apps on this NM or NM reconnect after restart. > Or it received DECOMMISSIONED event (after timeout from CLI). > In addition, it can back to “running” if user decides to cancel previous > decommission by calling recommission on the same node. The reaction to other > events is similar to RUNNING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633469#comment-14633469 ] Devaraj K commented on YARN-3896: - Thanks [~hex108] for the updated patch. There are some comments about the test. # Can we have a separate new test for this case instead of adding it with other existing test? # Can you avoid mentioning the JIRA ID in the comment? {code:xml}+// Simulate scenario from YARN-3896:{code} # There are multiple sleep statements with hard coded values in the newly added test code. Can you avoid these sleep with hard coded timeouts? # And also If I try to run the test without source changes, test is failing with this message "node shouldn't be null". Can we check for REBOOTED state here? > RMNode transitioned from RUNNING to REBOOTED because its response id had not > been reset > --- > > Key: YARN-3896 > URL: https://issues.apache.org/jira/browse/YARN-3896 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3896.01.patch, YARN-3896.02.patch, > YARN-3896.03.patch, YARN-3896.04.patch > > > {noformat} > 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: > Resolved 10.208.132.153 to /default-rack > 2015-07-03 16:49:39,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Reconnect from the node at: 10.208.132.153 > 2015-07-03 16:49:39,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered > with capability: , assigned nodeId > 10.208.132.153:8041 > 2015-07-03 16:49:39,104 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far > behind rm response id:2506413 nm response id:0 > 2015-07-03 16:49:39,137 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating > Node 10.208.132.153:8041 as it is now REBOOTED > 2015-07-03 16:49:39,137 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED > {noformat} > The node(10.208.132.153) reconnected with RM. When it registered with RM, RM > set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's > heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621947#comment-14621947 ] Devaraj K commented on YARN-3857: - Thanks [~mujunchao] for updated patch with test. Please take care of these comments also along with the [~zxu] comments fix. 1.I don't think adding this new method is required. Can we just use the ClientToAMTokenSecretManagerInRM#getMasterKey() to know whether the master key present or not? {code:xml} + + @VisibleForTesting + public synchronized boolean hasMasterKey( + ApplicationAttemptId applicationAttemptID) { + return this.masterKeys.containsKey(applicationAttemptID); + } {code} 2. I see there are some format issues in the patch w.r.t braces and indentation with spaces. Please go through the 'Making Changes' section in https://wiki.apache.org/hadoop/HowToContribute and configure your IDE according. It will be one time job and you don't have to worry next time for creating patches. {code:xml} +if(isSecurityEnabled) +{ {code} {code:xml} +} +else +{ {code} 4. Remove unused imports in RMAppAttemptImpl.java. > Memory leak in ResourceManager with SIMPLE mode > --- > > Key: YARN-3857 > URL: https://issues.apache.org/jira/browse/YARN-3857 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: mujunchao >Assignee: mujunchao >Priority: Critical > Attachments: YARN-3857-1.patch, YARN-3857-2.patch, > hadoop-yarn-server-resourcemanager.patch > > > We register the ClientTokenMasterKey to avoid client may hold an invalid > ClientToken after RM restarts. In SIMPLE mode, we register > Pair , But we never remove it from HashMap, as > unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3409) Add constraint node labels
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621811#comment-14621811 ] Devaraj K commented on YARN-3409: - [~leftnoteasy], Thanks for the details. Are you going to include the scenario of having service API to retrieve the labels in Resource Manager as discussed in YARN-3557 as part of this jira? Can we have a separate jira to discuss/handle the centralized configuration using a service API to retrieve the labels for nodes in Resource Manager? > Add constraint node labels > -- > > Key: YARN-3409 > URL: https://issues.apache.org/jira/browse/YARN-3409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, capacityscheduler, client >Reporter: Wangda Tan >Assignee: Wangda Tan > > Specify only one label for each node (IAW, partition a cluster) is a way to > determinate how resources of a special set of nodes could be shared by a > group of entities (like teams, departments, etc.). Partitions of a cluster > has following characteristics: > - Cluster divided to several disjoint sub clusters. > - ACL/priority can apply on partition (Only market team / marke team has > priority to use the partition). > - Percentage of capacities can apply on partition (Market team has 40% > minimum capacity and Dev team has 60% of minimum capacity of the partition). > Constraints are orthogonal to partition, they’re describing attributes of > node’s hardware/software just for affinity. Some example of constraints: > - glibc version > - JDK version > - Type of CPU (x86_64/i686) > - Type of OS (windows, linux, etc.) > With this, application can be able to ask for resource has (glibc.version >= > 2.20 && JDK.version >= 8u20 && x86_64). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3813) Support Application timeout feature in YARN.
[ https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619755#comment-14619755 ] Devaraj K commented on YARN-3813: - Thanks [~nijel] and [~rohithsharma] for the design proposal. {quote} New auxillary service : RMAppTimeOutService Responsibility is to track the running application. Simple logic //if job is running and the time elapsed kill if ((RMAppState == SUBMITTED/ACCEPTED/RUNNING) && && (currentTime - app.getSubmitTime()) >= timeout {quote} How frequently are you going to check this condition for each application? Can we have a monitor something like RMAppTimeOutMonitor which extends AbstractLivelinessMonitor, when the application gets submitted to RM then we can register the application with RMAppTimeOutMonitor using the user specified timeout. And when the timeout reaches, RMAppTimeOutMonitor can trigger an event to take an action further. bq. Yes, having a separate TIMEOUT event and TIMEOUT state is good approach and other option. Initially we consider to have new state TIMEOUT which require very huge changes across all the modules. I feel having a TIMEOUT state for RMAppImpl would be proper here. When RMAppTimeOutMonitor triggers an event on timeout for an application, RMAppImpl can move the state to TIMEOUT state from any of the non-final states and during the transition it can handle stopping the running attempt and the containers. I don't see here that there will be so many changes required for achieving it. > Support Application timeout feature in YARN. > - > > Key: YARN-3813 > URL: https://issues.apache.org/jira/browse/YARN-3813 > Project: Hadoop YARN > Issue Type: New Feature > Components: scheduler >Reporter: nijel > Attachments: YARN Application Timeout .pdf > > > It will be useful to support Application Timeout in YARN. Some use cases are > not worried about the output of the applications if the application is not > completed in a specific time. > *Background:* > The requirement is to show the CDR statistics of last few minutes, say for > every 5 minutes. The same Job will run continuously with different dataset. > So one job will be started in every 5 minutes. The estimate time for this > task is 2 minutes or lesser time. > If the application is not completing in the given time the output is not > useful. > *Proposal* > So idea is to support application timeout, with which timeout parameter is > given while submitting the job. > Here, user is expecting to finish (complete or kill) the application in the > given time. > One option for us is to move this logic to Application client (who submit the > job). > But it will be nice if it can be generic logic and can make more robust. > Kindly provide your suggestions/opinion on this feature. If it sounds good, i > will update the design doc and prototype patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)