[jira] [Created] (YARN-9154) Fix itemization in YARN service quickstart document
Akira Ajisaka created YARN-9154: --- Summary: Fix itemization in YARN service quickstart document Key: YARN-9154 URL: https://issues.apache.org/jira/browse/YARN-9154 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Akira Ajisaka {noformat:title=QuickStart.md} Params: - SERVICE_NAME: The name of the service. Note that this needs to be unique across running services for the current user. - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. {noformat} should be {noformat} Params: - SERVICE_NAME: The name of the service. Note that this needs to be unique across running services for the current user. - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. {noformat} to render correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9154) Fix itemization in YARN service quickstart document
[ https://issues.apache.org/jira/browse/YARN-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725624#comment-16725624 ] Akira Ajisaka edited comment on YARN-9154 at 12/20/18 6:55 AM: --- Attached a screenshort: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html#Deploy_a_service !Screen Shot 2018-12-20 at 15.54.16.png! was (Author: ajisakaa): Attached a screenshort: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html#Deploy_a_service > Fix itemization in YARN service quickstart document > --- > > Key: YARN-9154 > URL: https://issues.apache.org/jira/browse/YARN-9154 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Priority: Minor > Labels: newbie > Attachments: Screen Shot 2018-12-20 at 15.54.16.png > > > {noformat:title=QuickStart.md} > Params: > - SERVICE_NAME: The name of the service. Note that this needs to be unique > across running services for the current user. > - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. > {noformat} > should be > {noformat} > Params: > - SERVICE_NAME: The name of the service. Note that this needs to be unique > across running services for the current user. > - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. > {noformat} > to render correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9154) Fix itemization in YARN service quickstart document
[ https://issues.apache.org/jira/browse/YARN-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-9154: Attachment: Screen Shot 2018-12-20 at 15.54.16.png > Fix itemization in YARN service quickstart document > --- > > Key: YARN-9154 > URL: https://issues.apache.org/jira/browse/YARN-9154 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Priority: Minor > Labels: newbie > Attachments: Screen Shot 2018-12-20 at 15.54.16.png > > > {noformat:title=QuickStart.md} > Params: > - SERVICE_NAME: The name of the service. Note that this needs to be unique > across running services for the current user. > - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. > {noformat} > should be > {noformat} > Params: > - SERVICE_NAME: The name of the service. Note that this needs to be unique > across running services for the current user. > - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. > {noformat} > to render correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9154) Fix itemization in YARN service quickstart document
[ https://issues.apache.org/jira/browse/YARN-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725624#comment-16725624 ] Akira Ajisaka commented on YARN-9154: - Attached a screenshort: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html#Deploy_a_service > Fix itemization in YARN service quickstart document > --- > > Key: YARN-9154 > URL: https://issues.apache.org/jira/browse/YARN-9154 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Akira Ajisaka >Priority: Minor > Labels: newbie > Attachments: Screen Shot 2018-12-20 at 15.54.16.png > > > {noformat:title=QuickStart.md} > Params: > - SERVICE_NAME: The name of the service. Note that this needs to be unique > across running services for the current user. > - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. > {noformat} > should be > {noformat} > Params: > - SERVICE_NAME: The name of the service. Note that this needs to be unique > across running services for the current user. > - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format. > {noformat} > to render correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers
[ https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725606#comment-16725606 ] Hadoop QA commented on YARN-9038: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 6 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 36s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 22s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 28s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 45s{color} | {color:red} hadoop-yarn in the patch failed. {color} | | {color:red}-1{color} | {color:red} cc {color} | {color:red} 0m 45s{color} | {color:red} hadoop-yarn in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 45s{color} | {color:red} hadoop-yarn in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 25s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 5 new + 494 unchanged - 0 fixed = 499 total (was 494) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 30s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 28s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 22s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 24s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 38s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 30s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 17s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 92m 22s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed.
[jira] [Updated] (YARN-5168) Add port mapping handling when docker container use bridge network
[ https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xun Liu updated YARN-5168: -- Attachment: YARN-5168.019.patch > Add port mapping handling when docker container use bridge network > -- > > Key: YARN-5168 > URL: https://issues.apache.org/jira/browse/YARN-5168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Xun Liu >Priority: Major > Labels: Docker > Attachments: YARN-5168.001.patch, YARN-5168.002.patch, > YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, > YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, > YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, > YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, > YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, > YARN-5168.018.patch, YARN-5168.019.patch, exposedPorts1.png, exposedPorts2.png > > > YARN-4007 addresses different network setups when launching the docker > container. We need support port mapping when docker container uses bridge > network. > The following problems are what we faced: > 1. Add "-P" to map docker container's exposed ports to automatically. > 2. Add "-p" to let user specify specific ports to map. > 3. Add service registry support for bridge network case, then app could find > each other. It could be done out of YARN, however it might be more convenient > to support it natively in YARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725584#comment-16725584 ] Yuqi Wang commented on YARN-9151: - Thanks [~elgoiri], let me try to fix test, style and add UT for UnknownHostException. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > {noformat} > The standby RM failed to rejoin the election, but it will never retry or > crash later, *so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.* > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types (until we triaged it to be in whitelist), should > crash RM, because we *cannot ensure* that they will *never* cause RM cannot > work in standby state, and the *conservative* way is to crash RM. > Besides, after crash, the RM's external watchdog service can know this and > try to repair the RM machine, send alerts, etc. > And the RM can reload the latest
[jira] [Commented] (YARN-9129) Ensure flush after printing to log plus additional cleanup
[ https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725510#comment-16725510 ] Hadoop QA commented on YARN-9129: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 4s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 29s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 21s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 38s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 40s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}125m 43s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9129 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952417/YARN-9129.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle cc | | uname | Linux 2b7ce7082449 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e815fd9 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | Test
[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network
[ https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725557#comment-16725557 ] Eric Yang commented on YARN-5168: - [~liuxun323] Thank you for the patch. I think ContainerReport newInstance method does not need to have exposedPorts as parameter. This will minimize the changes. > Add port mapping handling when docker container use bridge network > -- > > Key: YARN-5168 > URL: https://issues.apache.org/jira/browse/YARN-5168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Xun Liu >Priority: Major > Labels: Docker > Attachments: YARN-5168.001.patch, YARN-5168.002.patch, > YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, > YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, > YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, > YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, > YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, > YARN-5168.018.patch, exposedPorts1.png, exposedPorts2.png > > > YARN-4007 addresses different network setups when launching the docker > container. We need support port mapping when docker container uses bridge > network. > The following problems are what we faced: > 1. Add "-P" to map docker container's exposed ports to automatically. > 2. Add "-p" to let user specify specific ports to map. > 3. Add service registry support for bridge network case, then app could find > each other. It could be done out of YARN, however it might be more convenient > to support it natively in YARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725560#comment-16725560 ] Hadoop QA commented on YARN-9116: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 35s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 48s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 8s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 12s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 10 new + 327 unchanged - 1 fixed = 337 total (was 328) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 42s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 90m 1s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 49s{color} | {color:green} hadoop-yarn-submarine in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 40s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}172m 34s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9116 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952434/YARN-9116.1.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7d0bf6844083 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e815fd9 | |
[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy
[ https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725542#comment-16725542 ] Íñigo Goiri commented on YARN-9130: --- Thanks [~trjianjianjiao] for the patch and [~surmountian] for the review! Committed to trunk. > Add Bind_HOST configuration for Yarn Web Proxy > -- > > Key: YARN-9130 > URL: https://issues.apache.org/jira/browse/YARN-9130 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.1 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9130.002.patch, YARN-9130.003.patch, > Yarn-9130.001.patch > > > Allow configurable bind-host for Yarn Web Proxy to allow overriding the host > name for which the server accepts connections. > It is similar to what have done in JournalNode and RM. Like > https://issues.apache.org/jira/browse/HDFS-13462 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8833) Avoid potential integer overflow when computing fair shares
[ https://issues.apache.org/jira/browse/YARN-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyakun updated YARN-8833: -- Description: When use w2rRatio compute fair share, there may be a chance triggering the problem of Int overflow, and entering an infinite loop. Since the compute share thread holds the writeLock, it may blocking scheduling thread. This issue occurs in a production environment. And we have already fixed it. added 2018-10-29: elaborate the problem /** * Compute the resources that would be used given a weight-to-resource ratio * w2rRatio, for use in the computeFairShares algorithm as described in # */ private static int resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection schedulables, String type) \{ int resourcesTaken = 0; for (Schedulable sched : schedulables) { int share = computeShare(sched, w2rRatio, type); resourcesTaken += share; } return resourcesTaken; } The variable resourcesTaken is an integer type. And it also is accumulated value of result of computeShare(Schedulable sched, double w2rRatio,String type) which is a value between the min share and max share of a queue. For example, when there are 3 queues, each has min share = max share = Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it will be a negative number. when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection schedulables, String type) return a negative number, the loop in computeSharesInternal() may never out which got the scheduler lock. //org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) < totalResource) { rMax *= 2.0; } This may blocking scheduling thread. was: When use w2rRatio compute fair share, there may be a chance triggering the problem of Int overflow, and entering an infinite loop. Since the compute share thread holds the writeLock, it may blocking scheduling thread. This issue occurs in a production environment with 8500 nodes. And we have already fixed it. added 2018-10-29: elaborate the problem /** * Compute the resources that would be used given a weight-to-resource ratio * w2rRatio, for use in the computeFairShares algorithm as described in # */ private static int resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection schedulables, String type) \{ int resourcesTaken = 0; for (Schedulable sched : schedulables) \{ int share = computeShare(sched, w2rRatio, type); resourcesTaken += share; } return resourcesTaken; } The variable resourcesTaken is an integer type. And it also is accumulated value of result of computeShare(Schedulable sched, double w2rRatio,String type) which is a value between the min share and max share of a queue. For example, when there are 3 queues, each has min share = max share = Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it will be a negative number. when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection schedulables, String type) return a negative number, the loop in computeSharesInternal() may never out which got the scheduler lock. //org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) < totalResource) { rMax *= 2.0; } This may blocking scheduling thread. > Avoid potential integer overflow when computing fair shares > --- > > Key: YARN-8833 > URL: https://issues.apache.org/jira/browse/YARN-8833 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: liyakun >Assignee: liyakun >Priority: Major > Fix For: 3.0.4, 3.1.2, 3.3.0, 3.2.1 > > Attachments: YARN-8833.1.patch, YARN-8833.2.patch, YARN-8833.3.patch, > YARN-8833.patch > > > When use w2rRatio compute fair share, there may be a chance triggering the > problem of Int overflow, and entering an infinite loop. > Since the compute share thread holds the writeLock, it may blocking > scheduling thread. > This issue occurs in a production environment. And we have already fixed it. > > added 2018-10-29: elaborate the problem > /** > * Compute the resources that would be used given a weight-to-resource ratio > * w2rRatio, for use in the computeFairShares algorithm as described in # > */ > private static int resourceUsedWithWeightToResourceRatio(double w2rRatio, > Collection schedulables, String type) \{ int > resourcesTaken = 0; for (Schedulable sched : schedulables) { int share = > computeShare(sched, w2rRatio, type); resourcesTaken += share; } > return resourcesTaken; > } > The variable resourcesTaken is an integer type. And it also is accumulated > value of result
[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy
[ https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725536#comment-16725536 ] Íñigo Goiri commented on YARN-9130: --- [^YARN-9130.003.patch] LGTM. The approach mimics what is available for other components for bind-host. +1 Committing to trunk. > Add Bind_HOST configuration for Yarn Web Proxy > -- > > Key: YARN-9130 > URL: https://issues.apache.org/jira/browse/YARN-9130 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.1 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Attachments: YARN-9130.002.patch, YARN-9130.003.patch, > Yarn-9130.001.patch > > > Allow configurable bind-host for Yarn Web Proxy to allow overriding the host > name for which the server accepts connections. > It is similar to what have done in JournalNode and RM. Like > https://issues.apache.org/jira/browse/HDFS-13462 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers
[ https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9038: -- Attachment: YARN-9038.004.patch > [CSI] Add ability to publish/unpublish volumes on node managers > --- > > Key: YARN-9038 > URL: https://issues.apache.org/jira/browse/YARN-9038 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: CSI > Attachments: YARN-9038.001.patch, YARN-9038.002.patch, > YARN-9038.003.patch, YARN-9038.004.patch > > > We need to add ability to publish volumes on node managers in staging area, > under NM's local dir. And then mount the path to docker container to make it > visible in the container. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers
[ https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725519#comment-16725519 ] Weiwei Yang commented on YARN-9038: --- Oops, v3 patch includes some unexpected changes. Correcting them now.. > [CSI] Add ability to publish/unpublish volumes on node managers > --- > > Key: YARN-9038 > URL: https://issues.apache.org/jira/browse/YARN-9038 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: CSI > Attachments: YARN-9038.001.patch, YARN-9038.002.patch, > YARN-9038.003.patch > > > We need to add ability to publish volumes on node managers in staging area, > under NM's local dir. And then mount the path to docker container to make it > visible in the container. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network
[ https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725513#comment-16725513 ] Xun Liu commented on YARN-5168: --- [~eyang] , I checked the code carefully and I feel that the 3 errors reported by Jenkins have nothing to do with my code. {quote}[https://builds.apache.org/job/PreCommit-YARN-Build/22923/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt] [https://builds.apache.org/job/PreCommit-YARN-Build/22923/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-applications-distributedshell.txt] [https://builds.apache.org/job/PreCommit-YARN-Build/22923/artifact/out/patch-unit-hadoop-tools_hadoop-sls.txt] {quote} I leave the container newInstance with 7 parameters, which are set by the setExposedPort function. Please help me review the code, thank you! > Add port mapping handling when docker container use bridge network > -- > > Key: YARN-5168 > URL: https://issues.apache.org/jira/browse/YARN-5168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Xun Liu >Priority: Major > Labels: Docker > Attachments: YARN-5168.001.patch, YARN-5168.002.patch, > YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, > YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, > YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, > YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, > YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, > YARN-5168.018.patch, exposedPorts1.png, exposedPorts2.png > > > YARN-4007 addresses different network setups when launching the docker > container. We need support port mapping when docker container uses bridge > network. > The following problems are what we faced: > 1. Add "-P" to map docker container's exposed ports to automatically. > 2. Add "-p" to let user specify specific ports to map. > 3. Add service registry support for bridge network case, then app could find > each other. It could be done out of YARN, however it might be more convenient > to support it natively in YARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy
[ https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725511#comment-16725511 ] Xiao Liang commented on YARN-9130: -- Thanks [~trjianjianjiao] for the patch, this configuration is necessary in certain case, and [^YARN-9130.003.patch] looks good to me, +1 for it. > Add Bind_HOST configuration for Yarn Web Proxy > -- > > Key: YARN-9130 > URL: https://issues.apache.org/jira/browse/YARN-9130 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.1 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Attachments: YARN-9130.002.patch, YARN-9130.003.patch, > Yarn-9130.001.patch > > > Allow configurable bind-host for Yarn Web Proxy to allow overriding the host > name for which the server accepts connections. > It is similar to what have done in JournalNode and RM. Like > https://issues.apache.org/jira/browse/HDFS-13462 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy
[ https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725507#comment-16725507 ] Hadoop QA commented on YARN-9130: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 34s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 3s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 18s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 0 new + 223 unchanged - 1 fixed = 223 total (was 224) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 7s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 46s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 28s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 51s{color} | {color:green} hadoop-yarn-server-web-proxy in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 83m 59s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9130 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952431/YARN-9130.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 981aab09e4f8 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64
[jira] [Resolved] (YARN-8523) Interactive docker shell
[ https://issues.apache.org/jira/browse/YARN-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang resolved YARN-8523. - Resolution: Fixed Fix Version/s: 3.3.0 Resolved by YARN-8762. > Interactive docker shell > > > Key: YARN-8523 > URL: https://issues.apache.org/jira/browse/YARN-8523 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Zian Chen >Priority: Major > Labels: Docker > Fix For: 3.3.0 > > > Some application might require interactive unix commands executions to carry > out operations. Container-executor can interface with docker exec to debug > or analyze docker containers while the application is running. It would be > nice to support an API to invoke docker exec to perform unix commands and > report back the output to application master. Application master can > distribute and aggregate execution of the commands to record in application > master log file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8762) [Umbrella] Support Interactive Docker Shell to running Containers
[ https://issues.apache.org/jira/browse/YARN-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang resolved YARN-8762. - Resolution: Fixed Fix Version/s: 3.3.0 Release Note: - Add shell access to YARN containers All tasks are done. Thank you [~Zian Chen] for the contribution. Thank you [~billie.rinaldi] for the detailed reviews. > [Umbrella] Support Interactive Docker Shell to running Containers > - > > Key: YARN-8762 > URL: https://issues.apache.org/jira/browse/YARN-8762 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Zian Chen >Assignee: Eric Yang >Priority: Major > Labels: Docker > Fix For: 3.3.0 > > Attachments: Interactive Docker Shell design doc.pdf > > > Debugging distributed application can be challenging on Hadoop. Hadoop > provide limited debugging ability through application log files. One of the > most frequently requested feature is to provide interactive shell to assist > real time debugging. This feature is inspired by docker exec to provide > ability to run arbitrary commands in docker container. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8762) [Umbrella] Support Interactive Docker Shell to running Containers
[ https://issues.apache.org/jira/browse/YARN-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned YARN-8762: --- Assignee: Eric Yang > [Umbrella] Support Interactive Docker Shell to running Containers > - > > Key: YARN-8762 > URL: https://issues.apache.org/jira/browse/YARN-8762 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Zian Chen >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: Interactive Docker Shell design doc.pdf > > > Debugging distributed application can be challenging on Hadoop. Hadoop > provide limited debugging ability through application log files. One of the > most frequently requested feature is to provide interactive shell to assist > real time debugging. This feature is inspired by docker exec to provide > ability to run arbitrary commands in docker container. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9129) Ensure flush after printing to log plus additional cleanup
[ https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725488#comment-16725488 ] Hudson commented on YARN-9129: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15639 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15639/]) YARN-9129. Ensure flush after printing to log plus additional cleanup. (billie: rev 2e544dc921afeaa02e731cb273ac7776eec6e49d) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/ContainerShellWebSocket.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/YarnClientImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c > Ensure flush after printing to log plus additional cleanup > -- > > Key: YARN-9129 > URL: https://issues.apache.org/jira/browse/YARN-9129 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Eric Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9129.001.patch, YARN-9129.002.patch, > YARN-9129.003.patch > > > Following up on findings in YARN-8962, I noticed the following issues in > container-executor and main.c: > - There seem to be some vars that are not cleaned up in container_executor: > In run_docker else: free docker_binary > In exec_container: > before return INVALID_COMMAND_FILE: free docker_binary > 3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead > cleanup needed before exit calls? > - In YARN-8777 we added several fprintf(stderr calls, but the convention in > container-executor.c appears to be fprintf(ERRORFILE followed by > fflush(ERRORFILE). > - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test. > - There are additional places where flush is not performed after writing to > stderr, including main.c display_feature_disabled_message. This can result in > the client not receiving the error message if the connection is closed too > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9129) Ensure flush after printing to log plus additional cleanup
[ https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725469#comment-16725469 ] Billie Rinaldi commented on YARN-9129: -- +1 for patch 3. Thanks, [~eyang]! > Ensure flush after printing to log plus additional cleanup > -- > > Key: YARN-9129 > URL: https://issues.apache.org/jira/browse/YARN-9129 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Eric Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9129.001.patch, YARN-9129.002.patch, > YARN-9129.003.patch > > > Following up on findings in YARN-8962, I noticed the following issues in > container-executor and main.c: > - There seem to be some vars that are not cleaned up in container_executor: > In run_docker else: free docker_binary > In exec_container: > before return INVALID_COMMAND_FILE: free docker_binary > 3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead > cleanup needed before exit calls? > - In YARN-8777 we added several fprintf(stderr calls, but the convention in > container-executor.c appears to be fprintf(ERRORFILE followed by > fflush(ERRORFILE). > - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test. > - There are additional places where flush is not performed after writing to > stderr, including main.c display_feature_disabled_message. This can result in > the client not receiving the error message if the connection is closed too > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9129) Ensure flush after printing to log plus additional cleanup
[ https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-9129: - Summary: Ensure flush after printing to log plus additional cleanup (was: Ensure flush after printing to stderr plus additional cleanup) > Ensure flush after printing to log plus additional cleanup > -- > > Key: YARN-9129 > URL: https://issues.apache.org/jira/browse/YARN-9129 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Eric Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9129.001.patch, YARN-9129.002.patch, > YARN-9129.003.patch > > > Following up on findings in YARN-8962, I noticed the following issues in > container-executor and main.c: > - There seem to be some vars that are not cleaned up in container_executor: > In run_docker else: free docker_binary > In exec_container: > before return INVALID_COMMAND_FILE: free docker_binary > 3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead > cleanup needed before exit calls? > - In YARN-8777 we added several fprintf(stderr calls, but the convention in > container-executor.c appears to be fprintf(ERRORFILE followed by > fflush(ERRORFILE). > - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test. > - There are additional places where flush is not performed after writing to > stderr, including main.c display_feature_disabled_message. This can result in > the client not receiving the error message if the connection is closed too > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725466#comment-16725466 ] Aihua Xu edited comment on YARN-9116 at 12/20/18 12:41 AM: --- Patch-1: in this patch, add the simple logic to give the default memory/vcore values to the queues if no configuration is set for such queues. A new configuration "yarn.scheduler.capacity.default-queue-maximum-allocation" is added to set the queue default for maximum allocation. I didn't implement queue inheritance since feel this would keep the configuration simpler. Let me know if it's needed and I can do that in the followup. was (Author: aihuaxu): Patch-1: in this patch, add the simple logic to give the default memory/vcore values to the queues if no configuration is set for such queues. A new configuration "yarn.scheduler.capacity.default-queue-maximum-allocation" is added to set the queue default for maximum allocation in the configuration. > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9116.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aihua Xu updated YARN-9116: --- Attachment: YARN-9116.1.patch > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9116.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9153) Diagnostics in the container status doesn't get reset after re-init
Chandni Singh created YARN-9153: --- Summary: Diagnostics in the container status doesn't get reset after re-init Key: YARN-9153 URL: https://issues.apache.org/jira/browse/YARN-9153 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Reporter: Chandni Singh Assignee: Chandni Singh When a container is reinitialized, its diagnostics are set to a long string - "Reinitializing await...". Even after the container starts running, this diagnostics is not cleared. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy
[ https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725455#comment-16725455 ] Rong Tang commented on YARN-9130: - [~elgoiri] Fixed the checkstyle. > Add Bind_HOST configuration for Yarn Web Proxy > -- > > Key: YARN-9130 > URL: https://issues.apache.org/jira/browse/YARN-9130 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.1 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Attachments: YARN-9130.002.patch, YARN-9130.003.patch, > Yarn-9130.001.patch > > > Allow configurable bind-host for Yarn Web Proxy to allow overriding the host > name for which the server accepts connections. > It is similar to what have done in JournalNode and RM. Like > https://issues.apache.org/jira/browse/HDFS-13462 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy
[ https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rong Tang updated YARN-9130: Attachment: YARN-9130.003.patch > Add Bind_HOST configuration for Yarn Web Proxy > -- > > Key: YARN-9130 > URL: https://issues.apache.org/jira/browse/YARN-9130 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.1.1 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Attachments: YARN-9130.002.patch, YARN-9130.003.patch, > Yarn-9130.001.patch > > > Allow configurable bind-host for Yarn Web Proxy to allow overriding the host > name for which the server accepts connections. > It is similar to what have done in JournalNode and RM. Like > https://issues.apache.org/jira/browse/HDFS-13462 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9129) Ensure flush after printing to stderr plus additional cleanup
[ https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725451#comment-16725451 ] Hadoop QA commented on YARN-9129: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 29s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 29s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 52s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 9m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 17s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 21s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 44s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 39s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}137m 14s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9129 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952417/YARN-9129.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle cc | | uname | Linux 6e7470c5bfd8 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e815fd9 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | Test
[jira] [Assigned] (YARN-9152) Auxiliary service REST API query does not return running services
[ https://issues.apache.org/jira/browse/YARN-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned YARN-9152: --- Assignee: Billie Rinaldi > Auxiliary service REST API query does not return running services > - > > Key: YARN-9152 > URL: https://issues.apache.org/jira/browse/YARN-9152 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Major > > Auxiliary service is configured with: > {code} > { > "services": [ > { > "name": "mapreduce_shuffle", > "version": "2", > "configuration": { > "properties": { > "class.name": "org.apache.hadoop.mapred.ShuffleHandler", > "mapreduce.shuffle.transfer.buffer.size": "102400", > "mapreduce.shuffle.port": "13563" > } > } > } > ] > } > {code} > Node manager log shows the service is registered: > {code} > 2018-12-19 22:38:57,466 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: > Reading auxiliary services manifest hdfs:/tmp/aux.json > 2018-12-19 22:38:57,827 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: > Initialized auxiliary service mapreduce_shuffle > 2018-12-19 22:38:57,828 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: > Adding auxiliary service mapreduce_shuffle version 2 > {code} > REST API query shows: > {code} > $ curl --negotiate -u : > http://eyang-3.openstacklocal:8042/ws/v1/node/auxiliaryservices > {"services":{}} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9152) Auxiliary service REST API query does not return running services
Eric Yang created YARN-9152: --- Summary: Auxiliary service REST API query does not return running services Key: YARN-9152 URL: https://issues.apache.org/jira/browse/YARN-9152 Project: Hadoop YARN Issue Type: Sub-task Reporter: Eric Yang Auxiliary service is configured with: {code} { "services": [ { "name": "mapreduce_shuffle", "version": "2", "configuration": { "properties": { "class.name": "org.apache.hadoop.mapred.ShuffleHandler", "mapreduce.shuffle.transfer.buffer.size": "102400", "mapreduce.shuffle.port": "13563" } } } ] } {code} Node manager log shows the service is registered: {code} 2018-12-19 22:38:57,466 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Reading auxiliary services manifest hdfs:/tmp/aux.json 2018-12-19 22:38:57,827 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Initialized auxiliary service mapreduce_shuffle 2018-12-19 22:38:57,828 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding auxiliary service mapreduce_shuffle version 2 {code} REST API query shows: {code} $ curl --negotiate -u : http://eyang-3.openstacklocal:8042/ws/v1/node/auxiliaryservices {"services":{}} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9129) Ensure flush after printing to stderr plus additional cleanup
[ https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-9129: Attachment: YARN-9129.003.patch > Ensure flush after printing to stderr plus additional cleanup > - > > Key: YARN-9129 > URL: https://issues.apache.org/jira/browse/YARN-9129 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Eric Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9129.001.patch, YARN-9129.002.patch, > YARN-9129.003.patch > > > Following up on findings in YARN-8962, I noticed the following issues in > container-executor and main.c: > - There seem to be some vars that are not cleaned up in container_executor: > In run_docker else: free docker_binary > In exec_container: > before return INVALID_COMMAND_FILE: free docker_binary > 3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead > cleanup needed before exit calls? > - In YARN-8777 we added several fprintf(stderr calls, but the convention in > container-executor.c appears to be fprintf(ERRORFILE followed by > fflush(ERRORFILE). > - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test. > - There are additional places where flush is not performed after writing to > stderr, including main.c display_feature_disabled_message. This can result in > the client not receiving the error message if the connection is closed too > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network
[ https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725325#comment-16725325 ] Hadoop QA commented on YARN-5168: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 13 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 22m 54s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 45s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 14m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 12s{color} | {color:green} root: The patch generated 0 new + 999 unchanged - 7 fixed = 999 total (was 1006) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 8s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 41s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 26s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 28s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 13s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 20s{color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}
[jira] [Commented] (YARN-9126) Container reinit always fails in branch-3.2 and trunk
[ https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725318#comment-16725318 ] Hudson commented on YARN-9126: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15638 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15638/]) YARN-9126. Fix container clean up for reinitialization. (eyang: rev e815fd9c49e80b9200dd8852abe74fe219ad9110) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainersLauncher.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainersLauncher.java > Container reinit always fails in branch-3.2 and trunk > - > > Key: YARN-9126 > URL: https://issues.apache.org/jira/browse/YARN-9126 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Chandni Singh >Priority: Major > Labels: docker > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9126.001.patch, YARN-9126.002.patch, > YARN-9126.003.patch > > > When upgrading container, container reinitialization always failed with code > 33. This error code means the localizing file already exist while copying > resource files. The container will retry with another container ID, hence > the problem is masked. > Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from > happening. The same logic might be useful in branch 3.2 and trunk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9126) Container reinit always fails in branch-3.2 and trunk
[ https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725274#comment-16725274 ] Hadoop QA commented on YARN-9126: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 2s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 23s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 114 unchanged - 9 fixed = 117 total (was 123) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 29s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 56s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 73m 26s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9126 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952388/YARN-9126.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 34cc13e7ba8a 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / cf57113 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/22926/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22926/testReport/ | | Max. process+thread count | 340 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-9131) Document usage of Dynamic auxiliary services
[ https://issues.apache.org/jira/browse/YARN-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725273#comment-16725273 ] Hadoop QA commented on YARN-9131: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 36s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 39m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 24m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 41s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 59s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 51s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 16s{color} | {color:green} hadoop-mapreduce-client-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 41s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}152m 32s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce
[jira] [Commented] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers
[ https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725234#comment-16725234 ] Hadoop QA commented on YARN-9038: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 6 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 41s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 19s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 46s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 35s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 22 new + 494 unchanged - 0 fixed = 516 total (was 494) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 13s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 20s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 49s{color} | {color:red} hadoop-yarn-api in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 31s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 19m 18s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 35s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 23s{color} | {color:red} hadoop-yarn-services-core in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 44s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} |
[jira] [Updated] (YARN-9126) Container reinit always fails in branch-3.2 and trunk
[ https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-9126: Attachment: YARN-9126.003.patch > Container reinit always fails in branch-3.2 and trunk > - > > Key: YARN-9126 > URL: https://issues.apache.org/jira/browse/YARN-9126 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Chandni Singh >Priority: Major > Labels: docker > Attachments: YARN-9126.001.patch, YARN-9126.002.patch, > YARN-9126.003.patch > > > When upgrading container, container reinitialization always failed with code > 33. This error code means the localizing file already exist while copying > resource files. The container will retry with another container ID, hence > the problem is masked. > Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from > happening. The same logic might be useful in branch 3.2 and trunk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9132) Add file permission check for auxiliary services manifest file
[ https://issues.apache.org/jira/browse/YARN-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725215#comment-16725215 ] Hadoop QA commented on YARN-9132: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 57s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 51s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 81m 33s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9132 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952255/YARN-9132.2.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux e33551813adf 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / cf57113 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/22924/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22924/testReport/ | | Max. process+thread count | 307 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725212#comment-16725212 ] Jim Brennan commented on YARN-9098: --- [~snemeth] I found the bug (or at least one bug). In findControllerPathInMountConfig(), you are not looping: {noformat} public static String findControllerPathInMountConfig(String controller, CGroupsMountConfig mountConfig) { String path = mountConfig.getPathForController(controller); if (path != null) { if (new File(path).canRead()) { return path; } else { LOG.warn(String.format( "Skipping inaccessible cgroup mount point %s", path)); } } return null; } {noformat} If the bad entry for the CPU controller comes before the good entry, then you will "skip" and return null for the CPU controller. This code path should loop through all of the mountconfig entries so that it will properly skip bad entries and find good ones. It also makes me concerned about other uses of mountConfig.getPathForController() - the current code essentially assumes that there is only one entry for each controller. The reason I think I am hitting it and you (and precommit) are not is because this is a hash map, so the ordering is essentially random depending on the file paths. Since our file paths are different, we get different orderings, and since in my case the bad entry comes first, the tests fail for me. I found this while debugging the testMtabParsing() test. > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > -- > > Key: YARN-9098 > URL: https://issues.apache.org/jira/browse/YARN-9098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9098.002.patch, YARN-9098.003.patch, > YARN-9098.004.patch > > > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores > cgroups data. > CGroupsLCEResourcesHandler also has a method with the same name, with > identical code. > The parser code should be extracted from these places and be added in a new > class as this is a separate responsibility. > As the output of the file parser is a Map>, it's better > to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance. > ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is > responsible for producing the same results (Map>) to > store cgroups data, it does not operate on mtab file, but looking at the > filesystem for cgroup settings. As the output is the same, CGroupsMountConfig > should be used here, too. > Again, this could should not be part of ResourceHandlerModule as it is a > different responsibility. > One more thing which is strongly related to the methods above is > CGroupsHandlerImpl.initializeFromMountConfig: This method processes the > result of a parsed mtab file or a parsed cgroups filesystem data and stores > file system paths for all available controllers. This method invokes > findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl > and CGroupsLCEResourcesHandler, so it should be moved to a single place. To > store filesystem path and controller mappings, a new domain object could be > introduced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725199#comment-16725199 ] Íñigo Goiri commented on YARN-9151: --- Thanks [~yqwang] for the patch. I think we want to add a specific test which handles the actual exception (i.e., {{UnknownHostException}}) and catch it. It should be a matter of adding a weird host to the connect string. Regarding the checkstyle, I'm not very sure how it checks indentation for switch/case but as this the first place it is used, let's follow the recommendation from Yetus and move to the left all the {{case}}. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > {noformat} > The standby RM failed to rejoin the election, but it will never retry or > crash later, *so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.* > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types (until we
[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903 ] Íñigo Goiri edited comment on YARN-9151 at 12/19/18 5:16 PM: - BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: hostxyz at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface. Could you please confirm above things? So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash. was (Author: yqwang): BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated YARN-9151: -- Attachment: (was: yarn_rm.zip) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > {noformat} > The standby RM failed to rejoin the election, but it will never retry or > crash later, *so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.* > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types (until we triaged it to be in whitelist), should > crash RM, because we *cannot ensure* that they will *never* cause RM cannot > work in standby state, and the *conservative* way is to crash RM. > Besides, after crash, the RM's external watchdog service can know this and > try to repair the RM machine, send alerts, etc. > And the RM can reload the latest zk connect string config with the latest > hostnames. > For more details, please
[jira] [Commented] (YARN-9132) Add file permission check for auxiliary services manifest file
[ https://issues.apache.org/jira/browse/YARN-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725160#comment-16725160 ] Billie Rinaldi commented on YARN-9132: -- There's a ticket open already about the flaky test failure. I've rerun the precommit as well. > Add file permission check for auxiliary services manifest file > -- > > Key: YARN-9132 > URL: https://issues.apache.org/jira/browse/YARN-9132 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Major > Attachments: YARN-9132.1.patch, YARN-9132.2.patch > > > The manifest file in HDFS must be owned by YARN admin or YARN service user > only. This check helps to prevent loading of malware into node manager JVM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9131) Document usage of Dynamic auxiliary services
[ https://issues.apache.org/jira/browse/YARN-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-9131: - Attachment: YARN-9131.4.patch > Document usage of Dynamic auxiliary services > > > Key: YARN-9131 > URL: https://issues.apache.org/jira/browse/YARN-9131 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Major > Attachments: YARN-9131.1.patch, YARN-9131.2.patch, YARN-9131.3.patch, > YARN-9131.4.patch > > > This is a follow up issue to document YARN-9075 for admin to control which > aux service to add or remove. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725136#comment-16725136 ] Jim Brennan commented on YARN-9098: --- [~snemeth] thanks for updating the patch. I will download and retest. > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > -- > > Key: YARN-9098 > URL: https://issues.apache.org/jira/browse/YARN-9098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9098.002.patch, YARN-9098.003.patch, > YARN-9098.004.patch > > > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores > cgroups data. > CGroupsLCEResourcesHandler also has a method with the same name, with > identical code. > The parser code should be extracted from these places and be added in a new > class as this is a separate responsibility. > As the output of the file parser is a Map>, it's better > to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance. > ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is > responsible for producing the same results (Map>) to > store cgroups data, it does not operate on mtab file, but looking at the > filesystem for cgroup settings. As the output is the same, CGroupsMountConfig > should be used here, too. > Again, this could should not be part of ResourceHandlerModule as it is a > different responsibility. > One more thing which is strongly related to the methods above is > CGroupsHandlerImpl.initializeFromMountConfig: This method processes the > result of a parsed mtab file or a parsed cgroups filesystem data and stores > file system paths for all available controllers. This method invokes > findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl > and CGroupsLCEResourcesHandler, so it should be moved to a single place. To > store filesystem path and controller mappings, a new domain object could be > introduced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Ks updated YARN-9150: Description: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.class YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. {code:java} bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create {code} was: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.class YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} ** The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. {code:java} bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create {code} > Making TimelineSchemaCreator support different backends for Timeline Schema > Creation in ATSv2 > - > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.class > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > {code:java} > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Ks updated YARN-9150: Summary: Making TimelineSchemaCreator support different backends for Timeline Schema Creation in ATSv2 (was: Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2) > Making TimelineSchemaCreator support different backends for Timeline Schema > Creation in ATSv2 > - > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.class > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > ** > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > {code:java} > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725084#comment-16725084 ] Sushil Ks commented on YARN-9150: - Hi [~rohithsharma] and [~vrushalic], Kindly review this patch. Have created this JIRA for making *TimelineSchemaCreator* support multiple back-ends as discussed when you reviewed the [YARN-9016|https://issues.apache.org/jira/browse/YARN-9016] . Note sure if the -1 for *compile* and *javac* tests posted from Jenkins above are related to my patch. > Making TimelineSchemaCreator to support different backends for Timeline > Schema Creation in ATSv2 > > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.class > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > ** > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > {code:java} > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725080#comment-16725080 ] Hadoop QA commented on YARN-9151: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 38s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 38s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 27s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 9 new + 48 unchanged - 0 fixed = 57 total (was 48) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 17s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 33s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 25m 36s{color} | {color:red} hadoop-yarn-client in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}199m 21s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9151 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952343/YARN-9151.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 3cc045645602 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / cf57113 | | maven |
[jira] [Updated] (YARN-5168) Add port mapping handling when docker container use bridge network
[ https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xun Liu updated YARN-5168: -- Attachment: YARN-5168.018.patch > Add port mapping handling when docker container use bridge network > -- > > Key: YARN-5168 > URL: https://issues.apache.org/jira/browse/YARN-5168 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Xun Liu >Priority: Major > Labels: Docker > Attachments: YARN-5168.001.patch, YARN-5168.002.patch, > YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, > YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, > YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, > YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, > YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, > YARN-5168.018.patch, exposedPorts1.png, exposedPorts2.png > > > YARN-4007 addresses different network setups when launching the docker > container. We need support port mapping when docker container uses bridge > network. > The following problems are what we faced: > 1. Add "-P" to map docker container's exposed ports to automatically. > 2. Add "-p" to let user specify specific ports to map. > 3. Add service registry support for bridge network case, then app could find > each other. It could be done out of YARN, however it might be more convenient > to support it natively in YARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers
[ https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9038: -- Attachment: YARN-9038.003.patch > [CSI] Add ability to publish/unpublish volumes on node managers > --- > > Key: YARN-9038 > URL: https://issues.apache.org/jira/browse/YARN-9038 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: CSI > Attachments: YARN-9038.001.patch, YARN-9038.002.patch, > YARN-9038.003.patch > > > We need to add ability to publish volumes on node managers in staging area, > under NM's local dir. And then mount the path to docker container to make it > visible in the container. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9033) ResourceHandlerChain#bootstrap is invoked twice during NM start if LinuxContainerExecutor enabled
[ https://issues.apache.org/jira/browse/YARN-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725010#comment-16725010 ] Zhankun Tang commented on YARN-9033: [~snemeth], thanks for looking at this. {quote}"But actually, the "updateContainer" invocation in YARN-7715 depend on containerId's cgroups path creation in "preStart" method which only happens when we use "LinuxContainerExecutor"." Where can I find this code part / what should I check? {quote} You can just test with LCE disabled but cGroupsMemoryResourceHandlerImpl enabled to try if YARN-7715 works. Per my testing, it doesn't work. Or understand that "updateContainer" in YARN-7715 is actually doing cgroups update. This cgroups update depend on an existing cgroups path. Take cGroupsMemoryResourceHandlerImpl for instance, The cGroupsMemoryResourceHandlerImpl#preStart created the memory cgroups path. And cGroupsMemoryResourceHandlerImpl#updateContainer update cgroups value in this path. But the preStart can only be invoked by LCE using ResourceHandlerChain's preStart. So YARN-7715 depend on LCE enabled. It shouldn't bootstrap ResourceHandleChain again. Not sure if this makes sense to you. > ResourceHandlerChain#bootstrap is invoked twice during NM start if > LinuxContainerExecutor enabled > - > > Key: YARN-9033 > URL: https://issues.apache.org/jira/browse/YARN-9033 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9033-trunk.001.patch, YARN-9033-trunk.002.patch > > > The ResourceHandlerChain#bootstrap will always be invoked in NM's > ContainerScheduler#serviceInit (Involved by YARN-7715) > So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked > first and then invoked again in ContainerScheduler#serviceInit. > But actually, the "updateContainer" invocation in YARN-7715 depend on > containerId's cgroups path creation in "preStart" method which only happens > when we use "LinuxContainerExecutor". So the bootstrap of > ResourceHandlerChain shouldn't happen in ContainerScheduler#serviceInit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724967#comment-16724967 ] Hadoop QA commented on YARN-9098: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 27s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 19s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 4 new + 11 unchanged - 0 fixed = 15 total (was 11) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 39s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 11s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 70m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12952342/YARN-9098.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux d61f07cd88cc 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / cf57113 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/22920/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22920/testReport/ | | Max. process+thread count | 424 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return {noformat} The standby RM failed to rejoin the election, but it will never retry or crash later, *so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.* So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types (until we triaged it to be in whitelist), should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc. And the RM can reload the latest zk connect string config with the latest hostnames. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop
[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903 ] Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:48 AM: BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface. Could you please confirm above things? So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash. was (Author: yqwang): BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to rejoin the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types (until we triaged it to be in whitelist), should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc. And the RM can reload the latest zk connect string config with the latest hostnames. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to rejoin the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types (until we triaged it to be in whitelist), should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc. And the RM can reload the latest zk connect string config with the latest hostnames. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Attachment: (was: YARN-9151.001.patch) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to rejoin the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author > [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will *never* cause RM cannot work in standby state, and the > *conservative* way is to crash RM. Besides, after crash, the RM's external > watchdog service can know this and try to repair the RM machine, send
[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724913#comment-16724913 ] Yuqi Wang commented on YARN-9151: - [~elgoiri], could you please also check it. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to rejoin the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author > [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will *never* cause RM cannot work in standby state, and the > *conservative* way is to crash RM. Besides, after crash, the RM's external > watchdog service can know this
[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724913#comment-16724913 ] Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:37 AM: [~elgoiri], could you please also check it. :) was (Author: yqwang): [~elgoiri], could you please also check it. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to rejoin the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author > [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will *never* cause RM cannot work in standby
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to rejoin the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc. And the RM can reload the latest zk connect string config with the latest hostnames. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start
[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903 ] Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM: BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface. So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk connect string config. was (Author: yqwang): BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903 ] Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM: BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface. So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash. was (Author: yqwang): BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at
[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903 ] Yuqi Wang commented on YARN-9151: - BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the *Guaranteeable* interface in Curator. So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk connect string config. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start
[jira] [Comment Edited] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724900#comment-16724900 ] Szilard Nemeth edited comment on YARN-9098 at 12/19/18 11:26 AM: - Hi [~Jim_Brennan]! Thanks for looking at my patch and taking steps on testing! First of all: the static import is fixed with the new patch. Actually, I understand your fear regarding this refactoring but this eliminates some code duplication and improves the quality of the code significantly. About the test errors: These are very strange errors. At this level, they seem to be all test framework issues, as somehow on your computer, the temp directory storing cgroups and the cpu controller underneath have not created, which is very strange as the tests are having assertions on the file and directory creations. I re-built and ran the testcases with these commands again on my computer, locally: 1. {code:java} mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true{code} 2. {code:java} mvn test -pl org.apache.hadoop:hadoop-yarn-server-nodemanager -fae | tee ~/maventest`date +%Y%m%d` {code} I don't have test failures on any of the classes where you had, this is the excerpt of the output: {noformat} [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.138 s - in org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser ... [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.538 s - in org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl ... [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.127 s - in org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths {noformat} This is the file listing of my target directory: {noformat} ??-( szilardnemeth@snemeth-MBP[10:17:19] <0> @YARN-9098 )--( ~/development/apache/hadoop )-- └-$ cat /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/025802ba-4abe-4862-942b-beef2d279ca7 none /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cp cgroup rw,relatime,cpu 0 0 none /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cpu cgroup rw,relatime,cpu 0 0 none /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/blkio cgroup rw,relatime,blkio 0 0{noformat} I even tried to remove the test directory (with {{rm -rf /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir}}) and re-execute the tests but they work fine. Anyway, I added some code that better describes the assertion errors plus added some logging of cgroup paths to 2 testcases. Could you please rerun the tests with the new code and do a file listing like I did? This must be some platform issue I suppose. The test failures for {{TestCGroupsHandlerImpl}}: {{TestCGroupsHandlerImpl#createPremountedCgroups}} calls: {noformat} File cpuCgroup = new File(parentDir, "cpu"); //and later on... assertTrue("Directory should be created", cpuCgroup.mkdirs()); {noformat} This should create cgroups for cpu, and as you can see, it is even asserted properly. Could you please re-test? Thanks! was (Author: snemeth): Hi [~Jim_Brennan]! Thanks for looking at my patch and taking steps on testing! First of all: the static import is fixed with the new patch. Actually, I understand your fear regarding this refactoring but this eliminates some code duplication and improves the quality of the code significantly. About the test errors: These are very strange errors. At this level, they seem to be all test framework issues, as somehow on your computer, the temp directory storing cgroups and the cpu controller underneath have not created, which is very strange as the tests are having assertions on the file and directory creations. I re-built and ran the testcases with these commands again on my computer, locally: 1. {code:java} mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true{code} 2. {code:java} mvn test -pl
[jira] [Updated] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9098: - Attachment: YARN-9098.004.patch > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > -- > > Key: YARN-9098 > URL: https://issues.apache.org/jira/browse/YARN-9098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9098.002.patch, YARN-9098.003.patch, > YARN-9098.004.patch > > > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores > cgroups data. > CGroupsLCEResourcesHandler also has a method with the same name, with > identical code. > The parser code should be extracted from these places and be added in a new > class as this is a separate responsibility. > As the output of the file parser is a Map>, it's better > to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance. > ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is > responsible for producing the same results (Map>) to > store cgroups data, it does not operate on mtab file, but looking at the > filesystem for cgroup settings. As the output is the same, CGroupsMountConfig > should be used here, too. > Again, this could should not be part of ResourceHandlerModule as it is a > different responsibility. > One more thing which is strongly related to the methods above is > CGroupsHandlerImpl.initializeFromMountConfig: This method processes the > result of a parsed mtab file or a parsed cgroups filesystem data and stores > file system paths for all available controllers. This method invokes > findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl > and CGroupsLCEResourcesHandler, so it should be moved to a single place. To > store filesystem path and controller mappings, a new domain object could be > introduced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724900#comment-16724900 ] Szilard Nemeth commented on YARN-9098: -- Hi [~Jim_Brennan]! Thanks for looking at my patch and taking steps on testing! First of all: the static import is fixed with the new patch. Actually, I understand your fear regarding this refactoring but this eliminates some code duplication and improves the quality of the code significantly. About the test errors: These are very strange errors. At this level, they seem to be all test framework issues, as somehow on your computer, the temp directory storing cgroups and the cpu controller underneath have not created, which is very strange as the tests are having assertions on the file and directory creations. I re-built and ran the testcases with these commands again on my computer, locally: 1. {code:java} mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true{code} 2. {code:java} mvn test -pl org.apache.hadoop:hadoop-yarn-server-nodemanager -fae | tee ~/maventest`date +%Y%m%d` {code} I don't have test failures on any of the classes where you had, this is the excerpt of the output: {noformat} [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.138 s - in org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser ... [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.538 s - in org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl ... [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.127 s - in org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths {noformat} This is the file listing of my target directory: {noformat} ??-( szilardnemeth@snemeth-MBP[10:17:19] <0> @YARN-9098 )--( ~/development/apache/hadoop )-- └-$ cat /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/025802ba-4abe-4862-942b-beef2d279ca7 none /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cp cgroup rw,relatime,cpu 0 0 none /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cpu cgroup rw,relatime,cpu 0 0 none /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/blkio cgroup rw,relatime,blkio 0 0{noformat} I even tried to remove the test directory (with {{rm -rf /Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir}}) and re-execute the tests but they work fine. Anyway, I added some code that better describes the assertion errors plus added some logging of cgroup paths to 2 testcases. Could you please rerun the tests with the new code and do a file listing like I did? This must be some platform issue I suppose. The test failures for {{TestCGroupsHandlerImpl}}: {{TestCGroupsHandlerImpl#createPremountedCgroups}} calls: {noformat} File cpuCgroup = new File(parentDir, "cpu"); //and later on... assertTrue("Directory should be created", cpuCgroup.mkdirs()); {noformat} This should create cgroups for cpu, and as you can see, it is even asserted properly. Thanks! > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > -- > > Key: YARN-9098 > URL: https://issues.apache.org/jira/browse/YARN-9098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9098.002.patch, YARN-9098.003.patch, > YARN-9098.004.patch > > > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores > cgroups data. > CGroupsLCEResourcesHandler also has a method with the same name, with > identical code. > The parser code
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to rejoin the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send event) Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to rejoin the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang, even if the zk connect string hostnames are changed back the orignal ones in DNS.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will *never* cause RM cannot work in standby state, and the *conservative* way is to crash RM. Besides, after crash, the RM's external watchdog service can know this and try to repair the RM machine, send alerts, etc. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election
[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724895#comment-16724895 ] Yuqi Wang commented on YARN-9151: - [~kasha] and [~templedf], could you please look at this issue and fix. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to rejoin the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will *never* cause RM cannot work in standby state, and the > *conservative* way is to crash RM. Besides, after crash, the RM's external > watchdog service can know this and try to repair the RM machine, send alerts, > etc. > For more details, please check the patch. -- This message was sent by
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to re-join the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will never cause RM cannot work in standby state, the *conservative* way is to crash RM. Besides, after crash, the RM watchdog can know this and try to repair the RM machine, send alerts, etc. For more details, please check the patch. was: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Fix Version/s: 2.9.2 > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1, 2.9.2 > > Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Release Note: Fix standby RM hangs (not retry or crash) forever due to forever lost from leader election. And now, RM will only transition to standby for known safe fatal events. (was: Fix standby RM hangs (not retry or crash) forever due to forever lost from leader election. And now, RM will only transition to standby for known fatal events.) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: {color:#205081}*Issue Summary:*{color} Standby RM hangs (not retry or crash) forever due to forever lost from leader election {color:#205081}*Issue Repro Steps:*{color} # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) {color:#205081}*Issue Logs:*{color} The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to re-join the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. {color:#205081}*Caused By:*{color} It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* {color:#205081}*What the Patch's solution:*{color} So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will never cause RM cannot work in standby state, the *conservative* way is to crash RM. Besides, after crash, the RM watchdog can know this and try to repair the RM machine, send alerts, etc. For more details, please check the patch. was: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Fix Version/s: (was: 2.9.2) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to re-join the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. *Caused By:* It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* *What the Patch's solution:* So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will never cause RM cannot work in standby state, the *conservative* way is to crash RM. Besides, after crash, the RM watchdog can know this and try to repair the RM machine, send alerts, etc. For more details, please check the patch. was: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Attachment: (was: YARN-9151-branch-2.9.2.001.patch) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Attachment: YARN-9151.001.patch > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Fix Version/s: (was: 3.1.1) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Attachment: YARN-9151-branch-2.9.2.001.patch > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Fix Version/s: (was: 2.9.2) 3.1.1 > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Fix Version/s: 2.9.2 > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2 >Reporter: Yuqi Wang >Assignee: Yuqi Wang >Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to re-join the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. *Caused By:* It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* *What the Patch's solution:* So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { {color:#14892c}// Source <- Store{color} {color:#14892c}STATE_STORE_FENCED,{color} {color:#14892c}STATE_STORE_OP_FAILED,{color} // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}// Source <- Admin Service{color} {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will never cause RM cannot work in standby state, the *conservative* way is to crash RM. Besides, after crash, the RM watchdog can know this and try to repair the RM machine, send alerts, etc. was: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due
[jira] [Created] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
Yuqi Wang created YARN-9151: --- Summary: Standby RM hangs (not retry or crash) forever due to forever lost from leader election Key: YARN-9151 URL: https://issues.apache.org/jira/browse/YARN-9151 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.9.2 Reporter: Yuqi Wang Assignee: Yuqi Wang Fix For: 3.1.1 Attachments: yarn_rm.zip *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to re-join the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. *Caused By:* It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* *What the Patch's solution:* So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { *{color:#14892c}// Source <- Store{color}* *{color:#14892c}STATE_STORE_FENCED,{color}* *{color:#14892c}STATE_STORE_OP_FAILED,{color}* // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}*// Source <- Admin Service*{color} {color:#14892c} *TRANSITION_TO_ACTIVE_FAILED,*{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will never cause RM cannot work in standby state, the *conservative* way is to crash RM. Besides, after crash, the RM watchdog can know this and try to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election
[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Wang updated YARN-9151: Description: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Found RMActiveServices's StandByTransitionRunnable object has already run previously, so immediately return (The standby RM failed to re-join the election, but it will never retry or crash later, so afterwards no zk related logs and the standby RM is forever hang.) {noformat} So, this should be a bug in RM, because *RM should always try to join election* (give up join election should only happen on RM decide to crash), otherwise, a RM without inside the election can never become active again and start real works. *Caused By:* It is introduced by YARN-3742 The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent happens, RM should transition to standby, instead of crash. *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to standby, instead of crash.* (In contrast, before this change, RM makes all to crash instead of to standby) So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will leave the standby RM continue not work, such as stay in standby forever. And as the author said: {quote}I think a good approach here would be to change the RMFatalEvent handler to transition to standby as the default reaction, *with shutdown as a special case for certain types of failures.* {quote} But the author is *too optimistic when implement the patch.* *What the Patch's solution:* So, for *conservative*, we would better *only transition to standby for the failures in {color:#14892c}whitelist{color}:* public enum RMFatalEventType { *{color:#14892c}// Source <- Store{color}* *{color:#14892c}STATE_STORE_FENCED,{color}* *{color:#14892c}STATE_STORE_OP_FAILED,{color}* // Source <- Embedded Elector EMBEDDED_ELECTOR_FAILED, {color:#14892c}*// Source <- Admin Service*{color} {color:#14892c} *TRANSITION_TO_ACTIVE_FAILED,*{color} // Source <- Critical Thread Crash CRITICAL_THREAD_CRASH } And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future added failure types, should crash RM, because we *cannot ensure* that they will never cause RM cannot work in standby state, the *conservative* way is to crash RM. Besides, after crash, the RM watchdog can know this and try to repair the RM machine, send alerts, etc. was: *Issue Summary:* Standby RM hangs (not retry or crash) forever due to forever lost from leader election *Issue Repro Steps:* # Start multiple RMs in HA mode # Modify all hostnames in the zk connect string to different values in DNS. (In reality, we need to replace old/bad zk machines to new/good zk machines, so their DNS hostname will be changed.) *Issue Logs:* The RM is BN4SCH101222318 You can check the full RM log in attachment, yarn_rm.zip. To make it clear, the whole story is: {noformat} Join Election Win the leader (ZK Node Creation Callback) Start to becomeActive Start RMActiveServices Start CommonNodeLabelsManager failed due to zk connect UnknownHostException Stop CommonNodeLabelsManager Stop RMActiveServices Create and Init RMActiveServices Fail to becomeActive ReJoin Election Failed to Join Election due to zk connect UnknownHostException (Here the exception is eat and just send transition to Standby event) Send RMFatalEvent to transition RM to standby Transitioning RM to Standby Start StandByTransitionThread Already in standby state ReJoin Election Failed to Join Election
[jira] [Commented] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724867#comment-16724867 ] Hadoop QA commented on YARN-9150: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 42s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 11s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 3m 58s{color} | {color:red} hadoop-yarn in trunk failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 49s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase-tests {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 58s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 7m 58s{color} | {color:red} hadoop-yarn-project_hadoop-yarn generated 46 new + 87 unchanged - 0 fixed = 133 total (was 87) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 34s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase-tests {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 42s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 10s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s{color} | {color:green} hadoop-yarn-server-timelineservice-hbase-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 33s{color} | {color:green} hadoop-yarn-server-timelineservice-hbase-tests in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} |
[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Ks updated YARN-9150: Description: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.class YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} ** The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. {code:java} bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create {code} was: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classamp;amp;amp;lt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} ** The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. {code:java} bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create {code} > Making TimelineSchemaCreator to support different backends for Timeline > Schema Creation in ATSv2 > > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.class > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > ** > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > {code:java} > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network
[ https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724815#comment-16724815 ] Hadoop QA commented on YARN-5168: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 13 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 9s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 23m 1s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 3s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 14m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 9s{color} | {color:green} root: The patch generated 0 new + 1004 unchanged - 7 fixed = 1004 total (was 1011) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 13s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 6s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 45s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 32s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 35s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 17s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 27s{color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}
[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Ks updated YARN-9150: Description: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classamp;amp;amp;lt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} ** The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. {code:java} bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create {code} was: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* ** Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classamp;amp;lt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} h3. *Running* The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. ** bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create > Making TimelineSchemaCreator to support different backends for Timeline > Schema Creation in ATSv2 > > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.classamp;amp;amp;lt;/name> > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > ** > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > {code:java} > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Ks updated YARN-9150: Description: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* ** Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classamp;lt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} h3. *Running* The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. ** bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create was: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* ** Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classlt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} h3. *Running* The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. ** bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create > Making TimelineSchemaCreator to support different backends for Timeline > Schema Creation in ATSv2 > > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > ** Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.classamp;lt;/name> > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > h3. > *Running* > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > ** > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2
[ https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Ks updated YARN-9150: Description: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* ** Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classamp;amp;lt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} h3. *Running* The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. ** bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create was: h3. Currently the TimelineSchemaCreator has a concrete implementation for creating Timeline Schema's only for HBase, Hence creating this JIRA for supporting multiple back-ends that ATSv2 can support. *Usage:* ** Add the following property in *yarn-site.xml* {code:java} yarn.timeline-service.schema-creator.classamp;lt;/name> YOUR_TIMELINE_SCHEMA_CREATOR_CLASS {code} h3. *Running* The Command needed to run the TimelineSchemaCreator need not be changed i.e the below existing command can be used irrespective of the backend configured. ** bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create > Making TimelineSchemaCreator to support different backends for Timeline > Schema Creation in ATSv2 > > > Key: YARN-9150 > URL: https://issues.apache.org/jira/browse/YARN-9150 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2 >Reporter: Sushil Ks >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-9150.001.patch > > > h3. Currently the TimelineSchemaCreator has a concrete implementation for > creating Timeline Schema's only for HBase, Hence creating this JIRA for > supporting multiple back-ends that ATSv2 can support. > *Usage:* > ** Add the following property in *yarn-site.xml* > {code:java} > > > yarn.timeline-service.schema-creator.classamp;amp;lt;/name> > YOUR_TIMELINE_SCHEMA_CREATOR_CLASS > > {code} > h3. > *Running* > The Command needed to run the TimelineSchemaCreator need not be changed > i.e the below existing command can be used irrespective of the backend > configured. > ** > bin/hadoop > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator > -create -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9149) yarn container -status misses logUrl when integrated with ATSv2
[ https://issues.apache.org/jira/browse/YARN-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S reassigned YARN-9149: --- Assignee: Rohith Sharma K S > yarn container -status misses logUrl when integrated with ATSv2 > --- > > Key: YARN-9149 > URL: https://issues.apache.org/jira/browse/YARN-9149 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > > Post YARN-8303, yarn client can be integrated with ATSv2. But log url and > start and end time is printing data is wrong! > {code} > Container Report : > Container-Id : container_1545035586969_0001_01_01 > Start-Time : 0 > Finish-Time : 0 > State : COMPLETE > Execution-Type : GUARANTEED > LOG-URL : null > Host : localhost:25006 > NodeHttpAddress : localhost:25008 > Diagnostics : > {code} > # TimelineEntityV2Converter#convertToContainerReport set logUrl as *null*. > This need set for proper log url based on yarn.log.server.web-service.url > # TimelineEntityV2Converter#convertToContainerReport parses start/end time > wrongly. Comparison should happen with entityType but below code is doing > entityId > {code} > if (events != null) { > for (TimelineEvent event : events) { > if (event.getId().equals( > ContainerMetricsConstants.CREATED_IN_RM_EVENT_TYPE)) { > createdTime = event.getTimestamp(); > } else if (event.getId().equals( > ContainerMetricsConstants.FINISHED_IN_RM_EVENT_TYPE)) { > finishedTime = event.getTimestamp(); > } > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org